VFS scalability git tree [Kernel]

Prev: [PATCH rfc] firewire: cdev: improve FW_CDEV_IOC_ALLOCATE
Next: tmio_mmc: Make ack_mmc_irqs() write-only

From: Dave Chinner on 28 Jul 2010 01:10

On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > solve. The difficulty (as always) is in reliably reproducing the bad
> > behaviour.
>
> Sure, and I didn't see any corruptions, it seems pretty stable and
> scalability is better than other filesystems. I'll see if I can
> give a better recipe to reproduce the 'livelock'ish behaviour.

Well, stable is a good start :)

> > > > fs_mark rate (thousands of files/second)
> > > > 2.6.35-rc5 2.6.35-rc5-scale
> > > > threads xfs ext4 xfs ext4
> > > > 1 20 39 20 39
> > > > 2 35 55 35 57
> > > > 4 60 41 57 42
> > > > 8 79 9 75 9
> > > >
> > > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > > going to ignore ext4 for the purposes of testing scalability here.
> > > >
> > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > > slightly lower throughput. So at this class of machine for this
> > > > workload, the changes result in a slight reduction in scalability.
> > >
> > > I wonder if these results are stable. It's possible that changes in
> > > reclaim behaviour are causing my patches to require more IO for a
> > > given unit of work?
> >
> > More likely that's the result of using a smaller log size because it
> > will require more frequent metadata pushes to make space for new
> > transactions.
>
> I was just checking whether your numbers are stable (where you
> saw some slowdown with vfs-scale patches), and what could be the
> cause. I agree that running real disks could make big changes in
> behaviour.

Yeah, the numbers are repeatable within about +/-5%. I generally
don't bother with optimisations that result in gains/losses less
than that because IO benchmarks that reliably repoduce results with
more precise repeatability than that are few and far between.

> > FWIW, I use PCP monitoring graphs to correlate behavioural changes
> > across different subsystems because it is far easier to relate
> > information visually than it is by looking at raw numbers or traces.
> > I think this graph shows the effect of relcaim on performance
> > most clearly:
> >
> > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png
>
> I haven't actually used that, it looks interesting.

The archiving side of PCP is the most useful, I find. i.e. being able
to record the metrics into a file and analyse them with pmchart or
other tools after the fact...

> > That is by far the largest improvement I've been able to obtain from
> > modifying the shrinker code, and it is from those sorts of
> > observations that I think that IO being issued from reclaim is
> > currently the most significant performance limiting factor for XFS
> > in this sort of workload....
>
> How is the xfs inode reclaim tied to linux inode reclaim? Does the
> xfs inode not become reclaimable until some time after the linux inode
> is reclaimed? Or what?

The struct xfs_inode embeds a struct inode like so:

struct xfs_inode {
.....
struct inode i_inode;
}

so they are the same chunk of memory. XFS does not use the VFS inode
hashes for finding inodes - that's what the per-ag radix trees are
used for. The xfs_inode lives longer than the struct inode because
we do non-trivial work after the VFS "reclaims" the struct inode.

For example, when an inode is unlinked
do not truncate or free the inode until after the VFS has finished with
it - the inode remains on the unlinked list (orphaned in ext3 terms)
from the time is is unlinked by the VFS to the time the last VFs
reference goes away. When XFS gets it, XFS then issues the inactive
transaction that takes the inode off the unlinked list and marks it
free in the inode alloc btree. This transaction is asynchronous and
dirties the xfs inode. Finally XFS will mark the inode as
reclaimable via a radix tree tag. The final processing of the inode
is then done via eaither a background relcaim walk from xfssyncd
(every 30s) where it will do non-blocking operations to finalіze
reclaim. It may take several passes to actually reclaim the inode.
e.g. one pass to force the log if the inode is pinned, another pass
to flush the inode to disk if it is dirty and not stale, and then
another pass to reclaim the inode once clean. There may be multiple
passes inbetween where the inode is skipped because those operations
have not completed.

And to top it all off, if the inode is looked up again (cache hit)
while in the reclaimable state, it will be removed from the reclaim
state and reused immediately. in this case we don't need to continue
the reclaim processing other things will ensure all the correct
information will go to disk.

> Do all or most of the xfs inodes require IO before being reclaimed
> during this test?

Yes, because all the inode are being dirtied and they are being
reclaimed faster than background flushing expires them.

> I wonder if you could throttle them a bit or sort
> them somehow so that they tend to be cleaned by writeout and reclaim
> just comes after and removes the clean ones, like pagecache reclaim
> is (supposed) to work.?

The whole point of using the radix trees is to get nicely sorted
reclaim IO - inodes are indexed by number, and the radix tree walk
gives us ascending inode number (and hence ascending block number)
reclaim - and the background reclaim allows optimal flushing to
occur by aggregating all the IO into delayed write metadata buffers
so they can be sorted and flushed to the elevator by the xfsbufd in
the most optimal manner possible.

The shrinker does preempt this somewhat, which is why delaying the
XFS shrinker's work appears to improve things alot. If the shrinker
is not running, the the background reclaim does exactly what you are
suggesting.

However, I don't think the increase in iops is caused by the XFS
inode shrinker - I think that it is the VFS cache shrinkers. If you
look at the the graphs in the link above, preformance doesn't
decrease when the XFS inode cache is being shrunk (top chart, yellow
trace) - it drops when the vfs caches are being shrunk (middle
chart). I haven't correlated the behaviour any further than that
because I haven't had time.

FWIW, all this background reclaim, radix tree reclaim tagging and
walking, embedded struct inodes, etc is all relatively new code.
The oldest bit of it was introduced in 2.6.31 (I think) and so a
significant part of what we are exploring here is uncharted
territory. The changes to relcaim, etc are aprtially reponsible for
the scalabilty we are geting from delayed logging, but there is
certainly room for improvement....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 28 Jul 2010 06:30

On Mon, Jul 26, 2010 at 03:41:11PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
>
> Pushed several fixes and improvements
> o XFS bugs fixed by Dave
> o dentry and inode stats bugs noticed by Dave
> o vmscan shrinker bugs fixed by KOSAKI san
> o compile bugs noticed by John
> o a few attempts to improve powerpc performance (eg. reducing smp_rmb())
> o scalability improvments for rename_lock

Yet another result on my small 2s8c Opteron. This time, the
re-aim benchmark configured as described here:

http://ertos.nicta.com.au/publications/papers/Chubb_Williams_05.pdf

It is using ext2 on ramdisk and an IO intensive workload, with fsync
activity.

I did 10 runs on each, and took the max jobs/sec of each run.

N Min Max Median Avg Stddev
x 10 2598750 2735122 2665384.6 2653353.8 46421.696
+ 10 3337297.3 3484687.5 3410689.7 3397763.8 49994.631
Difference at 95.0% confidence
744410 +/- 45327.3
28.0554% +/- 1.7083%

Average is 2653K jobs/s for vanilla, versus 3398K jobs/s for vfs-scalem
or 28% speedup.

The profile is interesting. It is known to be inode_lock intensive, but
we also see here that it is do_lookup intensive, due to cacheline bouncing
in common elements of path lookups.

Vanilla:
# Overhead Symbol
# ........ ......
#
7.63% [k] __d_lookup
|
|--88.59%-- do_lookup
|--9.75%-- __lookup_hash
|--0.89%-- d_lookup

7.17% [k] _raw_spin_lock
|
|--11.07%-- _atomic_dec_and_lock
| |
| |--53.73%-- dput
| --46.27%-- iput
|
|--9.85%-- __mark_inode_dirty
| |
| |--46.25%-- ext2_new_inode
| |--25.32%-- __set_page_dirty
| |--18.27%-- nobh_write_end
| |--6.91%-- ext2_new_blocks
| |--3.12%-- ext2_unlink
|
|--7.69%-- ext2_new_inode
|
|--6.84%-- insert_inode_locked
| ext2_new_inode
|
|--6.56%-- new_inode
| ext2_new_inode
|
|--5.61%-- writeback_single_inode
| sync_inode
| generic_file_fsync
| ext2_fsync
|
|--5.13%-- dput
|--3.75%-- generic_delete_inode
|--3.56%-- __d_lookup
|--3.53%-- ext2_free_inode
|--3.40%-- sync_inode
|--2.71%-- d_instantiate
|--2.36%-- d_delete
|--2.25%-- inode_sub_bytes
|--1.84%-- file_move
|--1.52%-- file_kill
|--1.36%-- ext2_new_blocks
|--1.34%-- ext2_create
|--1.34%-- d_alloc
|--1.11%-- do_lookup
|--1.07%-- iput
|--1.05%-- __d_instantiate

4.19% [k] mutex_spin_on_owner
|
|--99.92%-- __mutex_lock_slowpath
| mutex_lock
| |
| |--56.45%-- do_unlinkat
| | sys_unlink
| |
| --43.55%-- do_last
| do_filp_open

2.96% [k] _atomic_dec_and_lock
|
|--58.18%-- dput
|--31.02%-- mntput_no_expire
|--3.30%-- path_put
|--3.09%-- iput
|--2.69%-- link_path_walk
|--1.02%-- fput

2.73% [k] copy_user_generic_string
2.67% [k] __mark_inode_dirty
2.65% [k] link_path_walk
2.63% [k] mark_buffer_dirty
1.72% [k] __memcpy
1.62% [k] generic_getxattr
1.50% [k] acl_permission_check
1.30% [k] __find_get_block
1.30% [k] __memset
1.17% [k] ext2_find_entry
1.09% [k] ext2_new_inode
1.06% [k] system_call
1.01% [k] kmem_cache_free
1.00% [k] dput

In vfs-scale, most of the spinlock contention and path lookup cost is
gone. Contention for parent i_mutex (and d_lock) for creat/unlink
operations is now at the top of the profile.

A lot of the spinlock overhead seems to be not contention so much as
the the cost of the atomics. Down at 3% it is much less a problem than
it was though.

We may run into a bit of contention on the per-bdi inode dirty/io
list lock, with just a single ramdisk device (dirty/fsync activity
will hit this lock), but it is really not worth worrying about at
the moment.

# Overhead Symbol
# ........ ......
#
5.67% [k] mutex_spin_on_owner
|
|--99.96%-- __mutex_lock_slowpath
| mutex_lock
| |
| |--58.63%-- do_unlinkat
| | sys_unlink
| |
| --41.37%-- do_last
| do_filp_open

3.93% [k] __mark_inode_dirty
3.43% [k] copy_user_generic_string
3.31% [k] link_path_walk
3.15% [k] mark_buffer_dirty
3.11% [k] _raw_spin_lock
|
|--11.03%-- __mark_inode_dirty
|--10.54%-- ext2_new_inode
|--7.60%-- ext2_free_inode
|--6.33%-- inode_sub_bytes
|--6.27%-- ext2_new_blocks
|--5.80%-- generic_delete_inode
|--4.09%-- ext2_create
|--3.62%-- writeback_single_inode
|--2.92%-- sync_inode
|--2.81%-- generic_drop_inode
|--2.46%-- iput
|--1.86%-- dput
|--1.80%-- __dquot_alloc_space
|--1.61%-- __mutex_unlock_slowpath
|--1.59%-- generic_file_fsync
|--1.57%-- __d_instantiate
|--1.55%-- __set_page_dirty_buffers
|--1.36%-- d_alloc_and_lookup
|--1.23%-- do_path_lookup
|--1.10%-- ext2_free_blocks

2.13% [k] __memset
2.12% [k] __memcpy
1.98% [k] __d_lookup_rcu
1.46% [k] generic_getxattr
1.44% [k] ext2_find_entry
1.41% [k] __find_get_block
1.27% [k] kmem_cache_free
1.25% [k] ext2_new_inode
1.23% [k] system_call
1.02% [k] ext2_add_link
1.01% [k] strncpy_from_user
0.96% [k] kmem_cache_alloc
0.95% [k] find_get_page
0.94% [k] sysret_check
0.88% [k] __d_lookup
0.75% [k] ext2_delete_entry
0.70% [k] generic_file_aio_read
0.67% [k] generic_file_buffered_write
0.63% [k] ext2_new_blocks
0.62% [k] __percpu_counter_add
0.59% [k] __bread
0.58% [k] __wake_up_bit
0.58% [k] __mutex_lock_slowpath
0.56% [k] __ext2_write_inode
0.55% [k] ext2_get_blocks
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 30 Jul 2010 05:20

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
>
> Branch vfs-scale-working
>
> The really interesting new item is the store-free path walk, (43fe2b)
> which I've re-introduced. It has had a complete redesign, it has much
> better performance and scalability in more cases, and is actually sane
> code now.

Things are progressing well here with fixes and improvements to the
branch.

One thing that has been brought to my attention is that store-free path
walking (rcu-walk) drops into the normal refcounted walking on any
filesystem that has posix ACLs enabled.

Having misread that IS_POSIXACL is based on a superblock flag, I had
thought we only drop out of rcu-walk in case of encountering an inode
that actually has acls.

This is quite an important point for any performance testing work.
ACLs can actually be rcu checked quite easily in most cases, but it
takes a bit of work on APIs.

Filesystems defining their own ->permission and ->d_revalidate will
also not use rcu-walk. These could likewise be made to support rcu-walk
more widely, but it will require knowledge of rcu-walk to be pushed
into filesystems.

It's not a big deal, basically: no blocking, no stores, no referencing
non-rcu-protected data, and confirm with seqlock. That is usually the
case in fastpaths. If it cannot be satisfied, then just return -ECHILD
and you'll get called in the usual ref-walk mode next time.

But for now, keep this in mind if you plan to do any serious performance
testing work, *do not mount filesystems with ACL support*.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: john stultz on 2 Aug 2010 20:30

On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> >
> > Branch vfs-scale-working
> >
> > The really interesting new item is the store-free path walk, (43fe2b)
> > which I've re-introduced. It has had a complete redesign, it has much
> > better performance and scalability in more cases, and is actually sane
> > code now.
>
> Things are progressing well here with fixes and improvements to the
> branch.

Hey Nick,
Just another minor compile issue with today's vfs-scale-working branch.

fs/fuse/dir.c:231: error: 'fuse_dentry_revalidate_rcu' undeclared here
(not in a function)

>From looking at the vfat and ecryptfs changes in
582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
add the following?

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f0c2479..9ee4c10 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc)
* the lookup once more. If the lookup results in the same inode,
* then refresh the attributes, timeouts and mark the dentry valid.
*/
-static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd)
+static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd)
{
struct inode *inode = entry->d_inode;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 3 Aug 2010 01:50

On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote:
> On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > >
> > > Branch vfs-scale-working
> > >
> > > The really interesting new item is the store-free path walk, (43fe2b)
> > > which I've re-introduced. It has had a complete redesign, it has much
> > > better performance and scalability in more cases, and is actually sane
> > > code now.
> >
> > Things are progressing well here with fixes and improvements to the
> > branch.
>
> Hey Nick,
> Just another minor compile issue with today's vfs-scale-working branch.
>
> fs/fuse/dir.c:231: error: 'fuse_dentry_revalidate_rcu' undeclared here
> (not in a function)
>
> >From looking at the vfat and ecryptfs changes in
> 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
> add the following?

Thanks John, you're right.

I thought I actually linked and ran this, but I must not have had fuse
compiled in.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2 3 4
Prev: [PATCH rfc] firewire: cdev: improve FW_CDEV_IOC_ALLOCATE
Next: tmio_mmc: Make ack_mmc_irqs() write-only