VFS scalability git tree [Kernel]

Prev: [PATCH rfc] firewire: cdev: improve FW_CDEV_IOC_ALLOCATE
Next: tmio_mmc: Make ack_mmc_irqs() write-only

From: Nick Piggin on 23 Jul 2010 12:00

On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> >
> > Branch vfs-scale-working
>
> I've got a couple of patches needed to build XFS - they shrinker
> merge left some bad fragments - I'll post them in a minute. This

OK cool.

> email is for the longest ever lockdep warning I've seen that
> occurred on boot.

Ah thanks. OK that was one of my attempts to keep sockets out of
hidding the vfs as much as possible (lazy inode number evaluation).
Not a big problem, but I'll drop the patch for now.

I have just got one for you too, btw :) (on vanilla kernel but it is
messing up my lockdep stress testing on xfs). Real or false?

[ INFO: possible circular locking dependency detected ]
2.6.35-rc5-00064-ga9f7f2e #334
-------------------------------------------------------
kswapd0/605 is trying to acquire lock:
(&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>]
xfs_ilock+0x7c/0xa0

but task is already holding lock:
(&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
xfs_reclaim_inode_shrink+0xc6/0x140

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&xfs_mount_list_lock){++++.-}:
[<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
[<ffffffff815aa646>] _raw_spin_lock+0x36/0x50
[<ffffffff810fabf3>] try_to_free_buffers+0x43/0xb0
[<ffffffff812763b2>] xfs_vm_releasepage+0x92/0xe0
[<ffffffff810908ee>] try_to_release_page+0x2e/0x50
[<ffffffff8109ef56>] shrink_page_list+0x486/0x5a0
[<ffffffff8109f35d>] shrink_inactive_list+0x2ed/0x700
[<ffffffff8109fda0>] shrink_zone+0x3b0/0x460
[<ffffffff810a0f41>] try_to_free_pages+0x241/0x3a0
[<ffffffff810999e2>] __alloc_pages_nodemask+0x4c2/0x6b0
[<ffffffff810c52c6>] alloc_pages_current+0x76/0xf0
[<ffffffff8109205b>] __page_cache_alloc+0xb/0x10
[<ffffffff81092a2a>] find_or_create_page+0x4a/0xa0
[<ffffffff812780cc>] _xfs_buf_lookup_pages+0x14c/0x360
[<ffffffff81279122>] xfs_buf_get+0x72/0x160
[<ffffffff8126eb68>] xfs_trans_get_buf+0xc8/0xf0
[<ffffffff8124439f>] xfs_da_do_buf+0x3df/0x6d0
[<ffffffff81244825>] xfs_da_get_buf+0x25/0x30
[<ffffffff8124a076>] xfs_dir2_data_init+0x46/0xe0
[<ffffffff81247f89>] xfs_dir2_sf_to_block+0xb9/0x5a0
[<ffffffff812501c8>] xfs_dir2_sf_addname+0x418/0x5c0
[<ffffffff81247d7c>] xfs_dir_createname+0x14c/0x1a0
[<ffffffff81271d49>] xfs_create+0x449/0x5d0
[<ffffffff8127d802>] xfs_vn_mknod+0xa2/0x1b0
[<ffffffff8127d92b>] xfs_vn_create+0xb/0x10
[<ffffffff810ddc81>] vfs_create+0x81/0xd0
[<ffffffff810df1a5>] do_last+0x535/0x690
[<ffffffff810e11fd>] do_filp_open+0x21d/0x660
[<ffffffff810d16b4>] do_sys_open+0x64/0x140
[<ffffffff810d17bb>] sys_open+0x1b/0x20
[<ffffffff810023eb>] system_call_fastpath+0x16/0x1b

:-> #0 (&(&ip->i_lock)->mr_lock){++++--}:
[<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10
[<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
[<ffffffff8105dfba>] down_write_nested+0x4a/0x70
[<ffffffff8125500c>] xfs_ilock+0x7c/0xa0
[<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250
[<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120
[<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0
[<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140
[<ffffffff8109ff7d>] shrink_slab+0x12d/0x190
[<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0
[<ffffffff810a0b1e>] kswapd+0xbe/0x2a0
[<ffffffff810592ae>] kthread+0x8e/0xa0
[<ffffffff81003194>] kernel_thread_helper+0x4/0x10

other info that might help us debug this:

2 locks held by kswapd0/605:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff8109fe88>]
shrink_slab+0x38/0x190
#1: (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
xfs_reclaim_inode_shrink+0xc6/0x140

stack backtrace:
Pid: 605, comm: kswapd0 Not tainted 2.6.35-rc5-00064-ga9f7f2e #334
Call Trace:
[<ffffffff8106c5d9>] print_circular_bug+0xe9/0xf0
[<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10
[<ffffffff8106e3c2>] ? __lock_acquire+0x1092/0x1c10
[<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
[<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0
[<ffffffff8105dfba>] down_write_nested+0x4a/0x70
[<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0
[<ffffffff815ae795>] ? sub_preempt_count+0x95/0xd0
[<ffffffff8125500c>] xfs_ilock+0x7c/0xa0
[<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250
[<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120
[<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250
[<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0
[<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250
[<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140
[<ffffffff8109ff7d>] shrink_slab+0x12d/0x190
[<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0
[<ffffffff810a0b1e>] kswapd+0xbe/0x2a0
[<ffffffff81059700>] ? autoremove_wake_function+0x0/0x40
[<ffffffff815aaf3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x70
[<ffffffff810a0a60>] ? kswapd+0x0/0x2a0
[<ffffffff810592ae>] kthread+0x8e/0xa0
[<ffffffff81003194>] kernel_thread_helper+0x4/0x10
[<ffffffff815ab400>] ? restore_args+0x0/0x30
[<ffffffff81059220>] ? kthread+0x0/0xa0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 23 Jul 2010 12:20

On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> >
> > Branch vfs-scale-working
>
> Bug's I've noticed so far:
>
> - Using XFS, the existing vfs inode count statistic does not decrease
> as inodes are free.
> - the existing vfs dentry count remains at zero
> - the existing vfs free inode count remains at zero
>
> $ pminfo -f vfs.inodes vfs.dentry
>
> vfs.inodes.count
> value 7472612
>
> vfs.inodes.free
> value 0
>
> vfs.dentry.count
> value 0
>
> vfs.dentry.free
> value 0

Hm, I must have broken it along the way and not noticed. Thanks
for pointing that out.

> With a production build (i.e. no lockdep, no xfs debug), I'll
> run the same fs_mark parallel create/unlink workload to show
> scalability as I ran here:
>
> http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
>
> The numbers can't be directly compared, but the test and the setup
> is the same. The XFS numbers below are with delayed logging
> enabled. ext4 is using default mkfs and mount parameters except for
> barrier=0. All numbers are averages of three runs.
>
> fs_mark rate (thousands of files/second)
> 2.6.35-rc5 2.6.35-rc5-scale
> threads xfs ext4 xfs ext4
> 1 20 39 20 39
> 2 35 55 35 57
> 4 60 41 57 42
> 8 79 9 75 9
>
> ext4 is getting IO bound at more than 2 threads, so apart from
> pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> going to ignore ext4 for the purposes of testing scalability here.
>
> For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> CPU and with Nick's patches it's about 650% (10% higher) for
> slightly lower throughput. So at this class of machine for this
> workload, the changes result in a slight reduction in scalability.

That's a good test case, thanks. I'll see if I can find where
this is coming from. I will suspect RCU-inodes I suppose. Hm,
may have to make them DESTROY_BY_RCU afterall.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 23 Jul 2010 20:30

On Sat, Jul 24, 2010 at 01:51:18AM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > >
> > > Branch vfs-scale-working
> >
> > I've got a couple of patches needed to build XFS - they shrinker
> > merge left some bad fragments - I'll post them in a minute. This
>
> OK cool.
>
>
> > email is for the longest ever lockdep warning I've seen that
> > occurred on boot.
>
> Ah thanks. OK that was one of my attempts to keep sockets out of
> hidding the vfs as much as possible (lazy inode number evaluation).
> Not a big problem, but I'll drop the patch for now.
>
> I have just got one for you too, btw :) (on vanilla kernel but it is
> messing up my lockdep stress testing on xfs). Real or false?
>
> [ INFO: possible circular locking dependency detected ]
> 2.6.35-rc5-00064-ga9f7f2e #334
> -------------------------------------------------------
> kswapd0/605 is trying to acquire lock:
> (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>]
> xfs_ilock+0x7c/0xa0
>
> but task is already holding lock:
> (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
> xfs_reclaim_inode_shrink+0xc6/0x140

False positive, but the xfs_mount_list_lock is gone in 2.6.35-rc6 -
the shrinker context change has fixed that - so you can ignore it
anyway.

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 24 Jul 2010 04:50

> At this point, I would be very interested in reviewing, correctness
> testing on different configurations, and of course benchmarking.

I haven't review this series so long time. but I've found one misterious
shrink_slab() usage. can you please see my patch? (I will send it as
another mail)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 24 Jul 2010 07:00

> > At this point, I would be very interested in reviewing, correctness
> > testing on different configurations, and of course benchmarking.
>
> I haven't review this series so long time. but I've found one misterious
> shrink_slab() usage. can you please see my patch? (I will send it as
> another mail)

Plus, I have one question. upstream shrink_slab() calculation and your
calculation have bigger change rather than your patch description explained.

upstream:

shrink_slab()

lru_scanned max_pass
basic_scan_objects = 4 x ------------- x -----------------------------
lru_pages shrinker->seeks (default:2)

scan_objects = min(basic_scan_objects, max_pass * 2)

shrink_icache_memory()

sysctl_vfs_cache_pressure
max_pass = inodes_stat.nr_unused x --------------------------
100

That said, higher sysctl_vfs_cache_pressure makes higher slab reclaim.

In the other hand, your code:
shrinker_add_scan()

scanned objects
scan_objects = 4 x ------------- x ----------- x SHRINK_FACTOR x SHRINK_FACTOR
total ratio

shrink_icache_memory()

ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100

That said, higher sysctl_vfs_cache_pressure makes smaller slab reclaim.

So, I guess following change honorly refrect your original intention.

New calculation is,

shrinker_add_scan()

scanned
scan_objects = ------------- x objects x ratio
total

shrink_icache_memory()

ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100

This has the same behavior as upstream. because upstream's 4/shrinker->seeks = 2.
also the above has DEFAULT_SEEKS = SHRINK_FACTORx2.

===============
o move 'ratio' from denominator to numerator
o adapt kvm/mmu_shrink
o SHRINK_FACTOR / 2 (default seek) x 4 (unknown shrink slab modifier)
-> (SHRINK_FACTOR*2) == DEFAULT_SEEKS

---
arch/x86/kvm/mmu.c | 2 +-
mm/vmscan.c | 10 ++--------
2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ae5a038..cea1e92 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2942,7 +2942,7 @@ static int mmu_shrink(struct shrinker *shrink,
}

shrinker_add_scan(&nr_to_scan, scanned, global, cache_count,
- DEFAULT_SEEKS*10);
+ DEFAULT_SEEKS/10);

done:
cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 89b593e..2d8e9ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -208,14 +208,8 @@ void shrinker_add_scan(unsigned long *dst,
{
unsigned long long delta;

- /*
- * The constant 4 comes from old code. Who knows why.
- * This could all use a good tune up with some decent
- * benchmarks and numbers.
- */
- delta = (unsigned long long)scanned * objects
- * SHRINK_FACTOR * SHRINK_FACTOR * 4UL;
- do_div(delta, (ratio * total + 1));
+ delta = (unsigned long long)scanned * objects * ratio;
+ do_div(delta, total+ 1);

/*
* Avoid risking looping forever due to too large nr value:
--
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: [PATCH rfc] firewire: cdev: improve FW_CDEV_IOC_ALLOCATE
Next: tmio_mmc: Make ack_mmc_irqs() write-only