| 	
Prev: [PATCH rfc] firewire: cdev: improve FW_CDEV_IOC_ALLOCATE Next: tmio_mmc: Make ack_mmc_irqs() write-only 	
		 From: Dave Chinner on 28 Jul 2010 01:10 On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote: > > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > solve. The difficulty (as always) is in reliably reproducing the bad > > behaviour. > > Sure, and I didn't see any corruptions, it seems pretty stable and > scalability is better than other filesystems. I'll see if I can > give a better recipe to reproduce the 'livelock'ish behaviour. Well, stable is a good start :) > > > > fs_mark rate (thousands of files/second) > > > > 2.6.35-rc5 2.6.35-rc5-scale > > > > threads xfs ext4 xfs ext4 > > > > 1 20 39 20 39 > > > > 2 35 55 35 57 > > > > 4 60 41 57 42 > > > > 8 79 9 75 9 > > > > > > > > ext4 is getting IO bound at more than 2 threads, so apart from > > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > > > > going to ignore ext4 for the purposes of testing scalability here. > > > > > > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > > > > CPU and with Nick's patches it's about 650% (10% higher) for > > > > slightly lower throughput. So at this class of machine for this > > > > workload, the changes result in a slight reduction in scalability. > > > > > > I wonder if these results are stable. It's possible that changes in > > > reclaim behaviour are causing my patches to require more IO for a > > > given unit of work? > > > > More likely that's the result of using a smaller log size because it > > will require more frequent metadata pushes to make space for new > > transactions. > > I was just checking whether your numbers are stable (where you > saw some slowdown with vfs-scale patches), and what could be the > cause. I agree that running real disks could make big changes in > behaviour. Yeah, the numbers are repeatable within about +/-5%. I generally don't bother with optimisations that result in gains/losses less than that because IO benchmarks that reliably repoduce results with more precise repeatability than that are few and far between. > > FWIW, I use PCP monitoring graphs to correlate behavioural changes > > across different subsystems because it is far easier to relate > > information visually than it is by looking at raw numbers or traces. > > I think this graph shows the effect of relcaim on performance > > most clearly: > > > > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png > > I haven't actually used that, it looks interesting. The archiving side of PCP is the most useful, I find. i.e. being able to record the metrics into a file and analyse them with pmchart or other tools after the fact... > > That is by far the largest improvement I've been able to obtain from > > modifying the shrinker code, and it is from those sorts of > > observations that I think that IO being issued from reclaim is > > currently the most significant performance limiting factor for XFS > > in this sort of workload.... > > How is the xfs inode reclaim tied to linux inode reclaim? Does the > xfs inode not become reclaimable until some time after the linux inode > is reclaimed? Or what? The struct xfs_inode embeds a struct inode like so: struct xfs_inode { ..... struct inode i_inode; } so they are the same chunk of memory. XFS does not use the VFS inode hashes for finding inodes - that's what the per-ag radix trees are used for. The xfs_inode lives longer than the struct inode because we do non-trivial work after the VFS "reclaims" the struct inode. For example, when an inode is unlinked do not truncate or free the inode until after the VFS has finished with it - the inode remains on the unlinked list (orphaned in ext3 terms) from the time is is unlinked by the VFS to the time the last VFs reference goes away. When XFS gets it, XFS then issues the inactive transaction that takes the inode off the unlinked list and marks it free in the inode alloc btree. This transaction is asynchronous and dirties the xfs inode. Finally XFS will mark the inode as reclaimable via a radix tree tag. The final processing of the inode is then done via eaither a background relcaim walk from xfssyncd (every 30s) where it will do non-blocking operations to finalŃ–ze reclaim. It may take several passes to actually reclaim the inode. e.g. one pass to force the log if the inode is pinned, another pass to flush the inode to disk if it is dirty and not stale, and then another pass to reclaim the inode once clean. There may be multiple passes inbetween where the inode is skipped because those operations have not completed. And to top it all off, if the inode is looked up again (cache hit) while in the reclaimable state, it will be removed from the reclaim state and reused immediately. in this case we don't need to continue the reclaim processing other things will ensure all the correct information will go to disk. > Do all or most of the xfs inodes require IO before being reclaimed > during this test? Yes, because all the inode are being dirtied and they are being reclaimed faster than background flushing expires them. > I wonder if you could throttle them a bit or sort > them somehow so that they tend to be cleaned by writeout and reclaim > just comes after and removes the clean ones, like pagecache reclaim > is (supposed) to work.? The whole point of using the radix trees is to get nicely sorted reclaim IO - inodes are indexed by number, and the radix tree walk gives us ascending inode number (and hence ascending block number) reclaim - and the background reclaim allows optimal flushing to occur by aggregating all the IO into delayed write metadata buffers so they can be sorted and flushed to the elevator by the xfsbufd in the most optimal manner possible. The shrinker does preempt this somewhat, which is why delaying the XFS shrinker's work appears to improve things alot. If the shrinker is not running, the the background reclaim does exactly what you are suggesting. However, I don't think the increase in iops is caused by the XFS inode shrinker - I think that it is the VFS cache shrinkers. If you look at the the graphs in the link above, preformance doesn't decrease when the XFS inode cache is being shrunk (top chart, yellow trace) - it drops when the vfs caches are being shrunk (middle chart). I haven't correlated the behaviour any further than that because I haven't had time. FWIW, all this background reclaim, radix tree reclaim tagging and walking, embedded struct inodes, etc is all relatively new code. The oldest bit of it was introduced in 2.6.31 (I think) and so a significant part of what we are exploring here is uncharted territory. The changes to relcaim, etc are aprtially reponsible for the scalabilty we are geting from delayed logging, but there is certainly room for improvement.... Cheers, Dave. -- Dave Chinner david(a)fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Nick Piggin on 28 Jul 2010 06:30 On Mon, Jul 26, 2010 at 03:41:11PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Pushed several fixes and improvements > o XFS bugs fixed by Dave > o dentry and inode stats bugs noticed by Dave > o vmscan shrinker bugs fixed by KOSAKI san > o compile bugs noticed by John > o a few attempts to improve powerpc performance (eg. reducing smp_rmb()) > o scalability improvments for rename_lock Yet another result on my small 2s8c Opteron. This time, the re-aim benchmark configured as described here: http://ertos.nicta.com.au/publications/papers/Chubb_Williams_05.pdf It is using ext2 on ramdisk and an IO intensive workload, with fsync activity. I did 10 runs on each, and took the max jobs/sec of each run. N Min Max Median Avg Stddev x 10 2598750 2735122 2665384.6 2653353.8 46421.696 + 10 3337297.3 3484687.5 3410689.7 3397763.8 49994.631 Difference at 95.0% confidence 744410 +/- 45327.3 28.0554% +/- 1.7083% Average is 2653K jobs/s for vanilla, versus 3398K jobs/s for vfs-scalem or 28% speedup. The profile is interesting. It is known to be inode_lock intensive, but we also see here that it is do_lookup intensive, due to cacheline bouncing in common elements of path lookups. Vanilla: # Overhead Symbol # ........ ...... # 7.63% [k] __d_lookup | |--88.59%-- do_lookup |--9.75%-- __lookup_hash |--0.89%-- d_lookup 7.17% [k] _raw_spin_lock | |--11.07%-- _atomic_dec_and_lock | | | |--53.73%-- dput | --46.27%-- iput | |--9.85%-- __mark_inode_dirty | | | |--46.25%-- ext2_new_inode | |--25.32%-- __set_page_dirty | |--18.27%-- nobh_write_end | |--6.91%-- ext2_new_blocks | |--3.12%-- ext2_unlink | |--7.69%-- ext2_new_inode | |--6.84%-- insert_inode_locked | ext2_new_inode | |--6.56%-- new_inode | ext2_new_inode | |--5.61%-- writeback_single_inode | sync_inode | generic_file_fsync | ext2_fsync | |--5.13%-- dput |--3.75%-- generic_delete_inode |--3.56%-- __d_lookup |--3.53%-- ext2_free_inode |--3.40%-- sync_inode |--2.71%-- d_instantiate |--2.36%-- d_delete |--2.25%-- inode_sub_bytes |--1.84%-- file_move |--1.52%-- file_kill |--1.36%-- ext2_new_blocks |--1.34%-- ext2_create |--1.34%-- d_alloc |--1.11%-- do_lookup |--1.07%-- iput |--1.05%-- __d_instantiate 4.19% [k] mutex_spin_on_owner | |--99.92%-- __mutex_lock_slowpath | mutex_lock | | | |--56.45%-- do_unlinkat | | sys_unlink | | | --43.55%-- do_last | do_filp_open 2.96% [k] _atomic_dec_and_lock | |--58.18%-- dput |--31.02%-- mntput_no_expire |--3.30%-- path_put |--3.09%-- iput |--2.69%-- link_path_walk |--1.02%-- fput 2.73% [k] copy_user_generic_string 2.67% [k] __mark_inode_dirty 2.65% [k] link_path_walk 2.63% [k] mark_buffer_dirty 1.72% [k] __memcpy 1.62% [k] generic_getxattr 1.50% [k] acl_permission_check 1.30% [k] __find_get_block 1.30% [k] __memset 1.17% [k] ext2_find_entry 1.09% [k] ext2_new_inode 1.06% [k] system_call 1.01% [k] kmem_cache_free 1.00% [k] dput In vfs-scale, most of the spinlock contention and path lookup cost is gone. Contention for parent i_mutex (and d_lock) for creat/unlink operations is now at the top of the profile. A lot of the spinlock overhead seems to be not contention so much as the the cost of the atomics. Down at 3% it is much less a problem than it was though. We may run into a bit of contention on the per-bdi inode dirty/io list lock, with just a single ramdisk device (dirty/fsync activity will hit this lock), but it is really not worth worrying about at the moment. # Overhead Symbol # ........ ...... # 5.67% [k] mutex_spin_on_owner | |--99.96%-- __mutex_lock_slowpath | mutex_lock | | | |--58.63%-- do_unlinkat | | sys_unlink | | | --41.37%-- do_last | do_filp_open 3.93% [k] __mark_inode_dirty 3.43% [k] copy_user_generic_string 3.31% [k] link_path_walk 3.15% [k] mark_buffer_dirty 3.11% [k] _raw_spin_lock | |--11.03%-- __mark_inode_dirty |--10.54%-- ext2_new_inode |--7.60%-- ext2_free_inode |--6.33%-- inode_sub_bytes |--6.27%-- ext2_new_blocks |--5.80%-- generic_delete_inode |--4.09%-- ext2_create |--3.62%-- writeback_single_inode |--2.92%-- sync_inode |--2.81%-- generic_drop_inode |--2.46%-- iput |--1.86%-- dput |--1.80%-- __dquot_alloc_space |--1.61%-- __mutex_unlock_slowpath |--1.59%-- generic_file_fsync |--1.57%-- __d_instantiate |--1.55%-- __set_page_dirty_buffers |--1.36%-- d_alloc_and_lookup |--1.23%-- do_path_lookup |--1.10%-- ext2_free_blocks 2.13% [k] __memset 2.12% [k] __memcpy 1.98% [k] __d_lookup_rcu 1.46% [k] generic_getxattr 1.44% [k] ext2_find_entry 1.41% [k] __find_get_block 1.27% [k] kmem_cache_free 1.25% [k] ext2_new_inode 1.23% [k] system_call 1.02% [k] ext2_add_link 1.01% [k] strncpy_from_user 0.96% [k] kmem_cache_alloc 0.95% [k] find_get_page 0.94% [k] sysret_check 0.88% [k] __d_lookup 0.75% [k] ext2_delete_entry 0.70% [k] generic_file_aio_read 0.67% [k] generic_file_buffered_write 0.63% [k] ext2_new_blocks 0.62% [k] __percpu_counter_add 0.59% [k] __bread 0.58% [k] __wake_up_bit 0.58% [k] __mutex_lock_slowpath 0.56% [k] __ext2_write_inode 0.55% [k] ext2_get_blocks -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Nick Piggin on 30 Jul 2010 05:20 On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Branch vfs-scale-working > > The really interesting new item is the store-free path walk, (43fe2b) > which I've re-introduced. It has had a complete redesign, it has much > better performance and scalability in more cases, and is actually sane > code now. Things are progressing well here with fixes and improvements to the branch. One thing that has been brought to my attention is that store-free path walking (rcu-walk) drops into the normal refcounted walking on any filesystem that has posix ACLs enabled. Having misread that IS_POSIXACL is based on a superblock flag, I had thought we only drop out of rcu-walk in case of encountering an inode that actually has acls. This is quite an important point for any performance testing work. ACLs can actually be rcu checked quite easily in most cases, but it takes a bit of work on APIs. Filesystems defining their own ->permission and ->d_revalidate will also not use rcu-walk. These could likewise be made to support rcu-walk more widely, but it will require knowledge of rcu-walk to be pushed into filesystems. It's not a big deal, basically: no blocking, no stores, no referencing non-rcu-protected data, and confirm with seqlock. That is usually the case in fastpaths. If it cannot be satisfied, then just return -ECHILD and you'll get called in the usual ref-walk mode next time. But for now, keep this in mind if you plan to do any serious performance testing work, *do not mount filesystems with ACL support*. Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: john stultz on 2 Aug 2010 20:30 On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > > > The really interesting new item is the store-free path walk, (43fe2b) > > which I've re-introduced. It has had a complete redesign, it has much > > better performance and scalability in more cases, and is actually sane > > code now. > > Things are progressing well here with fixes and improvements to the > branch. Hey Nick, Just another minor compile issue with today's vfs-scale-working branch. fs/fuse/dir.c:231: error: 'fuse_dentry_revalidate_rcu' undeclared here (not in a function) >From looking at the vfat and ecryptfs changes in 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to add the following? diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index f0c2479..9ee4c10 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc) * the lookup once more. If the lookup results in the same inode, * then refresh the attributes, timeouts and mark the dentry valid. */ -static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd) +static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd) { struct inode *inode = entry->d_inode; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Nick Piggin on 3 Aug 2010 01:50 On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote: > On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > > > The really interesting new item is the store-free path walk, (43fe2b) > > > which I've re-introduced. It has had a complete redesign, it has much > > > better performance and scalability in more cases, and is actually sane > > > code now. > > > > Things are progressing well here with fixes and improvements to the > > branch. > > Hey Nick, > Just another minor compile issue with today's vfs-scale-working branch. > > fs/fuse/dir.c:231: error: 'fuse_dentry_revalidate_rcu' undeclared here > (not in a function) > > >From looking at the vfat and ecryptfs changes in > 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to > add the following? Thanks John, you're right. I thought I actually linked and ran this, but I must not have had fuse compiled in. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 
		 First
 | 
Prev
 | 
 Pages: 1 2 3 4 Prev: [PATCH rfc] firewire: cdev: improve FW_CDEV_IOC_ALLOCATE Next: tmio_mmc: Make ack_mmc_irqs() write-only |