vfs scalability patches updated [Kernel]

Prev: PM: Avoid losing wakeup events during suspend
Next: watchdog docs: add an entry for imx2_wdt

From: Linus Torvalds on 1 Jul 2010 13:40

On Wed, Jun 30, 2010 at 5:40 AM, Nick Piggin <npiggin(a)suse.de> wrote:
>>
>> That's a pretty big ouch. Why does RCU freeing of inodes cause that
>> much regression? The RCU freeing is out of line, so where does the big
>> impact come from?
>
> That comes mostly from inability to reuse the cache-hot inode structure,
> and the cost to go over the deferred RCU list and free them after they
> get cache cold.

I do wonder if this isn't a big design bug.

Most of the time with RCU, we don't need to wait to actually do the
_freeing_ of the individual data structure, we only need to make sure
that the data structure remains of the same _type_. IOW, we can free
it (and re-use it), but the backing storage cannot be released to the
page cache. That's what SLAB_DESTROY_BY_RCU should give us.

Is that not possible in this situation? Do we really need to keep the
inode _identity_ around for RCU?

If you use just SLAB_DESTROY_BY_RCU, then inode re-use remains, and
cache behavior would be much improved. The usual requirement for
SLAB_DESTROY_BY_RCU is that you only touch a lock (and perhaps
re-validate the identity) in the RCU-reader paths. Could that be made
to work?

Because that 27% drop really is pretty distressing.

That said, open (of the non-creating kind), close, and stat are
certainly more important than creating and freeing files. So as a
trade-off, it's probably the right thing to do. But if we can get all
the improvement _without_ that big downside, that would obviously be
better yet.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 1 Jul 2010 14:00

On Thu, Jul 01, 2010 at 10:35:35AM -0700, Linus Torvalds wrote:
> On Wed, Jun 30, 2010 at 5:40 AM, Nick Piggin <npiggin(a)suse.de> wrote:
> >>
> >> That's a pretty big ouch. Why does RCU freeing of inodes cause that
> >> much regression? The RCU freeing is out of line, so where does the big
> >> impact come from?
> >
> > That comes mostly from inability to reuse the cache-hot inode structure,
> > and the cost to go over the deferred RCU list and free them after they
> > get cache cold.
>
> I do wonder if this isn't a big design bug.

It's possible, yes. Although a lot of that drop does come from
hitting RCU and overruning slab allocator queues. It was what,
closer to 10% when doing small numbers of creat/unlink loops.

> Most of the time with RCU, we don't need to wait to actually do the
> _freeing_ of the individual data structure, we only need to make sure
> that the data structure remains of the same _type_. IOW, we can free
> it (and re-use it), but the backing storage cannot be released to the
> page cache. That's what SLAB_DESTROY_BY_RCU should give us.
>
> Is that not possible in this situation? Do we really need to keep the
> inode _identity_ around for RCU?
>
> If you use just SLAB_DESTROY_BY_RCU, then inode re-use remains, and
> cache behavior would be much improved. The usual requirement for
> SLAB_DESTROY_BY_RCU is that you only touch a lock (and perhaps
> re-validate the identity) in the RCU-reader paths. Could that be made
> to work?

I definitely thought of that. I actually thought it would not
be possible with the store-free path walk patches though, because
we need to check some inode properties (eg. permission). So I was
thinking the usual approach of taking a per-entry lock defeats
the whole purpose of store-free path walk.

But you've got me to think about it again and it should be possible to
do just using the dentry seqlock. IOW, if the inode gets disconnected
from the dentry (and then can get possibly freed and reused) then just
retry the lookup.

It may be a little tricky. I'll wait until the path-walk code is
more polished first.

>
> Because that 27% drop really is pretty distressing.
>
> That said, open (of the non-creating kind), close, and stat are
> certainly more important than creating and freeing files. So as a
> trade-off, it's probably the right thing to do. But if we can get all
> the improvement _without_ that big downside, that would obviously be
> better yet.

We have actually bigger regressions than that for other code
paths. The RCU freeing for files structs causes similar, about
20-30% regression in open/close.

I actually have a (proper) patch to make that use DESTROY_BY_RCU
too. It actually slows down fd lookup by a tiny bit, though
(lock, load, branch, increment, unlock versus atomic inc). But
same number of atomic ops.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul E. McKenney on 2 Jul 2010 00:10

On Thu, Jul 01, 2010 at 10:35:35AM -0700, Linus Torvalds wrote:
> On Wed, Jun 30, 2010 at 5:40 AM, Nick Piggin <npiggin(a)suse.de> wrote:
> >>
> >> That's a pretty big ouch. Why does RCU freeing of inodes cause that
> >> much regression? The RCU freeing is out of line, so where does the big
> >> impact come from?
> >
> > That comes mostly from inability to reuse the cache-hot inode structure,
> > and the cost to go over the deferred RCU list and free them after they
> > get cache cold.
>
> I do wonder if this isn't a big design bug.
>
> Most of the time with RCU, we don't need to wait to actually do the
> _freeing_ of the individual data structure, we only need to make sure
> that the data structure remains of the same _type_. IOW, we can free
> it (and re-use it), but the backing storage cannot be released to the
> page cache. That's what SLAB_DESTROY_BY_RCU should give us.
>
> Is that not possible in this situation? Do we really need to keep the
> inode _identity_ around for RCU?

In this case, the workload can be very update-heavy, so this type-safe
(vs. identity-safe) approach indeed makes a lot of sense. But if this
was a read-heavy situation (think SELinux or many areas in networking),
the read-side simplifications and speedups that often come with
identity safety would probably more than make up for the occasional
grace-period-induced cache miss.

So, as a -very- rough rule of thumb, when less than a few percent
of the accesses are updates, you most likely want identity safety.
If more than half of the accesses can be updates, you probably want
SLAB_DESTROY_BY_RCU-style type safety instead -- or maybe just straight
locking. If you are somewhere in between, pick one randomly, if
it works, go with it, otherwise try something else. ;-)

In this situation, a create/rename/delete workload would be quite update
heavy, so, as you say, SLAB_DESTROY_BY_RCU is well worth looking into.

Thanx, Paul

> If you use just SLAB_DESTROY_BY_RCU, then inode re-use remains, and
> cache behavior would be much improved. The usual requirement for
> SLAB_DESTROY_BY_RCU is that you only touch a lock (and perhaps
> re-validate the identity) in the RCU-reader paths. Could that be made
> to work?
>
> Because that 27% drop really is pretty distressing.
>
> That said, open (of the non-creating kind), close, and stat are
> certainly more important than creating and freeing files. So as a
> trade-off, it's probably the right thing to do. But if we can get all
> the improvement _without_ that big downside, that would obviously be
> better yet.
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 6 Jul 2010 13:50

On Fri, Jul 02, 2010 at 03:23:17AM +1000, Nick Piggin wrote:
> On Wed, Jun 30, 2010 at 10:40:49PM +1000, Nick Piggin wrote:
> > But actually it's not all for scalability. I have some follow on patches
> > (that require RCU inodes, among other things) that actually improve
> > single threaded performance significnatly. git diff workload IIRC was
> > several % improved from speeding up stat(2).
>
> I rewrote the store-free path walk patch that goes on top of this
> patchset (it's now much cleaner and more optimised, I'll post a patch
> soonish). It is quicker than I remembered.
>
> A single thread running stat(2) in a loop on a file "./file" has the
> following cost (on an 2s8c Barcelona):
>
> 2.6.35-rc3 595 ns/op
> patched 336 ns/op
>
> stat(2) takes 56% the time with patches. It's something like 13 fewer
> atomic operations per syscall.
>
> What's that good for? A single threaded, cached `git diff` on the linux
> kernel tree takes just 81% of the time after the vfs patches (0.27s vs
> 0.33s).

At the other end of the scale, I tried dbench on ramfs on the little
32n64c Altix. Dbench actually has statfs() call completely removed from
the workload -- it's still a little problematic and patched kernel
throughput is ~halved with statfs().

dbench procs 1 64
2.6.35-rc3 235MB/s 95MB/s ( 0.6% scaling)
patched 245MB/s 14870MB/s (94.8% scaling)

(note all these numbers are with store-free path walking patches on top
of the posted patchset -- dbench procs do path walking from common cwds
so it will never scale this well if we have to take refcounts on common
dentries)

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2 3
Prev: PM: Avoid losing wakeup events during suspend
Next: watchdog docs: add an entry for imx2_wdt