vfs scalability patches updated [Kernel]

Prev: PM: Avoid losing wakeup events during suspend
Next: watchdog docs: add an entry for imx2_wdt

From: Christoph Hellwig on 25 Jun 2010 03:20

If you actuall want to get this work in reposting huge patchkit again and
again probably doesn't help. Start to prioritize areas and work on small
sets to get them ready.

files_lock and vfsmount_lock seem like rather easy targets to start
with. But for files_lock I really want to see something to generalize
the tty special case. If you touch that are in detail that wart needs
to go. Al didn't seem to like my variant very much, so he might have
a better idea for it - otherwise it really makes the VFS locking simple
by removing any tty interaction with the superblock files list. The
other suggestion would be to only open regular (maybe even just
writeable) files to the list. In addition to reducing the number of
list operations require it will also make the tty code a lot easier.

As for the other patches: I don't think the massive fine-grained
locking in the hash tables is a good idea. I would recommend to defer
them for now, and then look into better data structures for these caches
instead of working around the inherent problems of global hash tables.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 25 Jun 2010 04:10

On Fri, Jun 25, 2010 at 03:12:21AM -0400, Christoph Hellwig wrote:
> If you actuall want to get this work in reposting huge patchkit again and
> again probably doesn't help. Start to prioritize areas and work on small
> sets to get them ready.

Sure, I haven't been posting the same thing (haven't posted it for a
long time). This simply had a lot of new stuff and improvements to all
existing patches.

I didn't cc anyone in particular because it's only for interested
people to take a look at. As you saw last time I cc'ed Al I exactly
was just trying to get those easier targets merged.

> files_lock and vfsmount_lock seem like rather easy targets to start
> with. But for files_lock I really want to see something to generalize
> the tty special case. If you touch that are in detail that wart needs
> to go. Al didn't seem to like my variant very much, so he might have
> a better idea for it - otherwise it really makes the VFS locking simple
> by removing any tty interaction with the superblock files list.

Actually I didn't like it because the error handling in the tty code
was broken and difficult to fix properly. The concept was OK though.

But the fact is that today already tty "knows" that vfs doesn't need its
files on the superblock list, and so it may take them off and use that
list_head privately. Currently it is also using files lock to protect
that private usage. These are two independent problems. My patch fixes
the second, and anything that fixes the first also needs to fix the
second in exactly the same way.

> The
> other suggestion would be to only open regular (maybe even just
> writeable) files to the list. In addition to reducing the number of
> list operations require it will also make the tty code a lot easier.

This was my suggestion, yes. Either way is conceptually the same, this
one just avoids the memory allocation and error handling problems that
yours had.

But again, locking change is still required and it would look exactly
the same as my patch really.

> As for the other patches: I don't think the massive fine-grained
> locking in the hash tables is a good idea. I would recommend to defer
> them for now, and then look into better data structures for these caches
> instead of working around the inherent problems of global hash tables.

I don't agree actually. I don't think there is any downside to fine
grained locking the hash with bit spinlocks. Until I see one, I will
keep them.

I agree that some other data structure may be better, but it should be
compared with the best possible hash implementation, which is a scalable
hash like this one.

Also, our big impending performance problem is SMP scalability, not hash
lookup, AFAIKS.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 30 Jun 2010 07:40

On Thu, Jun 24, 2010 at 01:02:12PM +1000, npiggin(a)suse.de wrote:
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/

Can you put a git tree up somewhere?

> Update to vfs scalability patches:

.....

Now that I've had a look at the whole series, I'll make an overall
comment: I suspect that the locking is sufficiently complex that we
can count the number of people that will be able to debug it on one
hand. This patch set didn't just fall off the locking cliff, it
fell into a bottomless pit...

> Performance:
> Last time I was testing on a 32-node Altix which could be considered as not a
> sweet-spot for Linux performance target (ie. improvements there may not justify
> complexity). So recently I've been testing with a tightly interconnected
> 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
> system.

Sure, but I have to question how much of this is actually necessary?
A lot of it looks like scalability for scalabilities sake, not
because there is a demonstrated need...

> *** Single-thread microbenchmark (simple syscall loops, lower is better):
> Test Difference at 95.0% confidence (50 runs)
> open/close -6.07% +/- 1.075%
> creat/unlink 27.83% +/- 0.522%
> Open/close is a little faster, which should be due to one less atomic in the
> dput common case. Creat/unlink is significantly slower, which is due to RCU
> freeing inodes.

That's a pretty big ouch. Why does RCU freeing of inodes cause that
much regression? The RCU freeing is out of line, so where does the big
impact come from?

> *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
> vanilla vfs
> real 0m4.911s 0m0.183s
> user 0m1.920s 0m1.610s
> sys 4m58.670s 0m5.770s
> After vfs patches, 26x increase in throughput, however parallelism is limited
> by test spawning and exit phases. sys time improvement shows closer to 50x
> improvement. vanilla is bottlenecked on dcache_lock.

So if we cherry pick patches out of the series, what is the bare
minimum set needed to obtain a result in this ballpark? Same for the
other tests?

> *** Reclaim
> I have not done much reclaim testing yet. It should be more scalable and lower
> latency due to significant reduction in lru locks interfering with other
> critical sections in inode/dentry code, and because we have per-zone locks.
> Per-zone LRUs mean that reclaim is targetted to the correct zone, and that
> kswapd will operate on lists of node-local memory objects.

This means we no longer have any global LRUness to inode or dentry
reclaim, which is going to significantly change caching behaviour.
It's also got interesting corner cases like a workload running on a
single node with a dentry/icache working set larger than the VM
wants to hold on a single node.

We went through these sorts of problems with cpusets a few years
back, and the workaround for it was not to limit the slab cache to
the cpuset's nodes. Handling this sort of problem correctly seems
distinctly non-trivial, so I'm really very reluctant to move in this
direction without clear evidence that we have no other
alternative....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 30 Jun 2010 10:40

On Wed, Jun 30, 2010 at 09:30:54PM +1000, Dave Chinner wrote:
> On Thu, Jun 24, 2010 at 01:02:12PM +1000, npiggin(a)suse.de wrote:
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
>
> Can you put a git tree up somewhere?

I suppose I should. I'll try to set one up.

> > Update to vfs scalability patches:
>
> ....
>
> Now that I've had a look at the whole series, I'll make an overall
> comment: I suspect that the locking is sufficiently complex that we
> can count the number of people that will be able to debug it on one
> hand.

As opposed to everyone who understood it beforehand? :)

> This patch set didn't just fall off the locking cliff, it
> fell into a bottomless pit...

I actually think it's simpler in ways. It has more locks, but a
lot of those protect small, well defined data.

Filesystems have required surprisingly minimal changes (except
autofs4, but that's fairly special case).

> > Performance:
> > Last time I was testing on a 32-node Altix which could be considered as not a
> > sweet-spot for Linux performance target (ie. improvements there may not justify
> > complexity). So recently I've been testing with a tightly interconnected
> > 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
> > system.
>
> Sure, but I have to question how much of this is actually necessary?
> A lot of it looks like scalability for scalabilities sake, not
> because there is a demonstrated need...

People are complaining about vfs scalability already (at least Intel,
Google, IBM, and networking people). By the time people start shouting,
it's too late because it will take years to get the patches merged. I'm
not counting -rt people who have a bad time with global vfs locks.

You saw the "batched dput+iput" hacks that google posted a couple of
years ago. Those were in the days of 4 core Core2 CPUs, long before 16
thread Nehalems that will scale well to 4/8 sockets at low cost.

At the high end, vaguely extrapolating from my numbers, a big POWER7 may
do under 100 open/close operations per second per hw thread. A big UV
probably under 10 per core.

But actually it's not all for scalability. I have some follow on patches
(that require RCU inodes, among other things) that actually improve
single threaded performance significnatly. git diff workload IIRC was
several % improved from speeding up stat(2).

> > *** Single-thread microbenchmark (simple syscall loops, lower is better):
> > Test Difference at 95.0% confidence (50 runs)
> > open/close -6.07% +/- 1.075%
> > creat/unlink 27.83% +/- 0.522%
> > Open/close is a little faster, which should be due to one less atomic in the
> > dput common case. Creat/unlink is significantly slower, which is due to RCU
> > freeing inodes.
>
> That's a pretty big ouch. Why does RCU freeing of inodes cause that
> much regression? The RCU freeing is out of line, so where does the big
> impact come from?

That comes mostly from inability to reuse the cache-hot inode structure,
and the cost to go over the deferred RCU list and free them after they
get cache cold.

> > *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
> > vanilla vfs
> > real 0m4.911s 0m0.183s
> > user 0m1.920s 0m1.610s
> > sys 4m58.670s 0m5.770s
> > After vfs patches, 26x increase in throughput, however parallelism is limited
> > by test spawning and exit phases. sys time improvement shows closer to 50x
> > improvement. vanilla is bottlenecked on dcache_lock.
>
> So if we cherry pick patches out of the series, what is the bare
> minimum set needed to obtain a result in this ballpark? Same for the
> other tests?

Well it's very hard to just scale up bits and pieces because the
dcache_lock is currently basically global (except for d_flags and
some cases of d_count manipulations).

Start chipping away at bits and pieces of it as people hit bottlenecks
and I think it will end in a bigger mess than we have now.

I don't think this should be done lightly, but I think it is going to
be required soon.

> > *** Reclaim
> > I have not done much reclaim testing yet. It should be more scalable and lower
> > latency due to significant reduction in lru locks interfering with other
> > critical sections in inode/dentry code, and because we have per-zone locks.
> > Per-zone LRUs mean that reclaim is targetted to the correct zone, and that
> > kswapd will operate on lists of node-local memory objects.
>
> This means we no longer have any global LRUness to inode or dentry
> reclaim, which is going to significantly change caching behaviour.
> It's also got interesting corner cases like a workload running on a
> single node with a dentry/icache working set larger than the VM
> wants to hold on a single node.
>
> We went through these sorts of problems with cpusets a few years
> back, and the workaround for it was not to limit the slab cache to
> the cpuset's nodes. Handling this sort of problem correctly seems
> distinctly non-trivial, so I'm really very reluctant to move in this
> direction without clear evidence that we have no other
> alternative....

As I explained in the other mail, that's not actaully how the
per-zone reclaim works.

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frank Mayhar on 30 Jun 2010 13:10

On Wed, 2010-06-30 at 21:30 +1000, Dave Chinner wrote:
> Sure, but I have to question how much of this is actually necessary?
> A lot of it looks like scalability for scalabilities sake, not
> because there is a demonstrated need...

Well, we've repeatedly run into problems with contention on the
dcache_lock as well as the inode_lock; changes that improve those paths
are extremely interesting to us. I've also seen numbers from systems
with large (i.e. 32 to 64) numbers of cores that clearly show serious
problems in this area.

Further, while this seems like a bunch of patches, a close look shows
that it basically just pushes the dcache and inode locks down as far as
possible, making other improvements (such as removal of a few atomics
and no longer batching inode reclaims, among other things) based on that
work. I would be hard-pressed to find much to cherry-pick from this
patch set.

One interesting thing might be to do a set of performance tests for
kernels with increasingly more of the patchset, just to see the effect
of the earlier patches against a vanilla kernel and to measure the
cumulative effect of the later patches. (I'm not volunteering, however:
ENOTIME.)
--
Frank Mayhar <fmayhar(a)google.com>
Google, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3
Prev: PM: Avoid losing wakeup events during suspend
Next: watchdog docs: add an entry for imx2_wdt