From: Christoph Hellwig on
I might sound like a broken record, but if you want to make forward
progress with this split it into smaller series.

What would be useful for example would be one series each to split
the global inode_lock and dcache_lock, without introducing all the
fancy new locking primitives, per-bucket locks and lru schemes for
a start.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
>
> Branch vfs-scale-working

I've got a couple of patches needed to build XFS - they shrinker
merge left some bad fragments - I'll post them in a minute. This
email is for the longest ever lockdep warning I've seen that
occurred on boot.

Cheers,

Dave.

[ 6.368707] ======================================================
[ 6.369773] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[ 6.370379] 2.6.35-rc5-dgc+ #58
[ 6.370882] ------------------------------------------------------
[ 6.371475] pmcd/2124 [HC0[0]:SC0[1]:HE1:SE0] is trying to acquire:
[ 6.372062] (&sb->s_type->i_lock_key#6){+.+...}, at: [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[ 6.372268]
[ 6.372268] and this task is already holding:
[ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120
[ 6.372268] which would create a new lock dependency:
[ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...} -> (&sb->s_type->i_lock_key#6){+.+...}
[ 6.372268]
[ 6.372268] but this new dependency connects a SOFTIRQ-irq-safe lock:
[ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}
[ 6.372268] ... which became SOFTIRQ-irq-safe at:
[ 6.372268] [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[ 6.372268] [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0
[ 6.372268] [<ffffffff81795502>] tcp_check_req+0x202/0x440
[ 6.372268] [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0
[ 6.372268] [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0
[ 6.372268] [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0
[ 6.372268] [<ffffffff81771af7>] ip_rcv+0x397/0x760
[ 6.372268] [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330
[ 6.372268] [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0
[ 6.372268] [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0
[ 6.372268] [<ffffffff81084cc2>] __do_softirq+0xd2/0x260
[ 6.372268] [<ffffffff81035edc>] call_softirq+0x1c/0x50
[ 6.372268] [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0
[ 6.372268] [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40
[ 6.372268] [<ffffffff8173c59e>] release_sock+0x14e/0x1a0
[ 6.372268] [<ffffffff817a3975>] inet_stream_connect+0x75/0x320
[ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268]
[ 6.372268] to a SOFTIRQ-irq-unsafe lock:
[ 6.372268] (&sb->s_type->i_lock_key#6){+.+...}
[ 6.372268] ... which became SOFTIRQ-irq-unsafe at:
[ 6.372268] ... [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0
[ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59
[ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204
[ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 6.372268]
[ 6.372268] other info that might help us debug this:
[ 6.372268]
[ 6.372268] 3 locks held by pmcd/2124:
[ 6.372268] #0: (&p->lock){+.+.+.}, at: [<ffffffff81171dae>] seq_read+0x3e/0x430
[ 6.372268] #1: (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120
[ 6.372268] #2: (clock-AF_INET){++....}, at: [<ffffffff8173b6ae>] sock_i_ino+0x2e/0x70
[ 6.372268]
[ 6.372268] the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[ 6.372268] -> (&(&hashinfo->ehash_locks[i])->rlock){+.-...} ops: 3 {
[ 6.372268] HARDIRQ-ON-W at:
[ 6.372268] [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[ 6.372268] [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0
[ 6.372268] [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60
[ 6.372268] [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0
[ 6.372268] [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320
[ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268] IN-SOFTIRQ-W at:
[ 6.372268] [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[ 6.372268] [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0
[ 6.372268] [<ffffffff81795502>] tcp_check_req+0x202/0x440
[ 6.372268] [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0
[ 6.372268] [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0
[ 6.372268] [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0
[ 6.372268] [<ffffffff81771af7>] ip_rcv+0x397/0x760
[ 6.372268] [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330
[ 6.372268] [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0
[ 6.372268] [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0
[ 6.372268] [<ffffffff81084cc2>] __do_softirq+0xd2/0x260
[ 6.372268] [<ffffffff81035edc>] call_softirq+0x1c/0x50
[ 6.372268] [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0
[ 6.372268] [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40
[ 6.372268] [<ffffffff8173c59e>] release_sock+0x14e/0x1a0
[ 6.372268] [<ffffffff817a3975>] inet_stream_connect+0x75/0x320
[ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268] INITIAL USE at:
[ 6.372268] [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[ 6.372268] [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0
[ 6.372268] [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60
[ 6.372268] [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0
[ 6.372268] [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320
[ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268] }
[ 6.372268] ... key at: [<ffffffff8285ddf8>] __key.47027+0x0/0x8
[ 6.372268] ... acquired at:
[ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430
[ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170
[ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268]
[ 6.372268]
[ 6.372268] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[ 6.372268] -> (&sb->s_type->i_lock_key#6){+.+...} ops: 1185 {
[ 6.372268] HARDIRQ-ON-W at:
[ 6.372268] [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0
[ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59
[ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204
[ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 6.372268] SOFTIRQ-ON-W at:
[ 6.372268] [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0
[ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59
[ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204
[ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 6.372268] INITIAL USE at:
[ 6.372268] [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0
[ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[ 6.372268] [<f [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59
[ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204
[ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 6.372268] }
[ 6.372268] ... key at: [<ffffffff81bd5bd8>] sock_fs_type+0x58/0x80
[ 6.372268] ... acquired at:
[ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430
[ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170
[ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268]
[ 6.372268]
[ 6.372268] stack backtrace:
[ 6.372268] Pid: 2124, comm: pmcd Not tainted 2.6.35-rc5-dgc+ #58
[ 6.372268] Call Trace:
[ 6.372268] [<ffffffff810b28d9>] check_usage+0x499/0x4a0
[ 6.372268] [<ffffffff810b24c6>] ? check_usage+0x86/0x4a0
[ 6.372268] [<ffffffff810af729>] ? __bfs+0x129/0x260
[ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[ 6.372268] [<ffffffff81791750>] ? established_get_first+0x60/0x120
[ 6.372268] [<ffffffff8182beb7>] ? _raw_spin_lock_bh+0x67/0x70
[ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430
[ 6.372268] [<ffffffff81171d70>] ? seq_read+0x0/0x430
[ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170
[ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b

--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
>
> Branch vfs-scale-working

Bug's I've noticed so far:

- Using XFS, the existing vfs inode count statistic does not decrease
as inodes are free.
- the existing vfs dentry count remains at zero
- the existing vfs free inode count remains at zero

$ pminfo -f vfs.inodes vfs.dentry

vfs.inodes.count
value 7472612

vfs.inodes.free
value 0

vfs.dentry.count
value 0

vfs.dentry.free
value 0


Performance Summary:

With lockdep and CONFIG_XFS_DEBUG enabled, a 16 thread parallel
sequential create/unlink workload on an 8p/4GB RAM VM with a virtio
block device sitting on a short-stroked 12x2TB SAS array w/ 512MB
BBWC in RAID0 via dm and using the noop elevator in the guest VM:

$ sudo mkfs.xfs -f -l size=128m -d agcount=16 /dev/vdb
meta-data=/dev/vdb isize=256 agcount=16, agsize=1638400 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=26214400, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ sudo mount -o delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ cd ~/src/fs_mark-3.3/
$ ./fs_mark -S0 -n 500000 -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 -d /mnt/scratch/15

files/s
2.6.34-rc4 12550
2.6.35-rc5+scale 12285

So the same within the error margins of the benchmark.

Screenshot of monitoring graphs - you can see the effect of the
broken stats:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-16x500-xfs.png
http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc5-npiggin-scale-lockdep-16x500-xfs.png

With a production build (i.e. no lockdep, no xfs debug), I'll
run the same fs_mark parallel create/unlink workload to show
scalability as I ran here:

http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

The numbers can't be directly compared, but the test and the setup
is the same. The XFS numbers below are with delayed logging
enabled. ext4 is using default mkfs and mount parameters except for
barrier=0. All numbers are averages of three runs.

fs_mark rate (thousands of files/second)
2.6.35-rc5 2.6.35-rc5-scale
threads xfs ext4 xfs ext4
1 20 39 20 39
2 35 55 35 57
4 60 41 57 42
8 79 9 75 9

ext4 is getting IO bound at more than 2 threads, so apart from
pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
going to ignore ext4 for the purposes of testing scalability here.

For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
CPU and with Nick's patches it's about 650% (10% higher) for
slightly lower throughput. So at this class of machine for this
workload, the changes result in a slight reduction in scalability.

I looked at dbench on XFS as well, but didn't see any significant
change in the numbers at up to 200 load threads, so not much to
talk about there.

Sometime over the weekend I'll build a 16p VM and see what I get
from that...

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

> Summary of a few numbers I've run. google's socket teardown workload
> runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20%
> on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s
> up to 15GB/s).

Following post just contains some preliminary benchmark numbers on a
POWER7. Boring if you're not interested in this stuff.

IBM and Mikey kindly allowed me to do some test runs on a big POWER7
system today. Very is the only word I'm authorized to describe how big
is big. We tested the vfs-scale-working and master branches from my git
tree as of today. I'll stick with relative numbers to be safe. All
tests were run on ramfs.


First and very important is single threaded performance of basic code.
POWER7 is obviously vastly different from a Barcelona or Nehalem. and
store-free path walk uses a lot of seqlocks, which are cheap on x86, a
little more epensive on others.

Test case time difference, vanilla to vfs-scale (negative is better)
stat() -10.8% +/- 0.3%
close(open()) 4.3% +/- 0.3%
unlink(creat()) 36.8% +/- 0.3%

stat is significantly faster which is really good.

open/close is a bit slower which we didn't get time to analyse. There
are one or two seqlock checks which might be avoided, which could make
up the difference. It's not horrible, but I hope to get POWER7
open/close more competitive (on x86 open/close is even a bit faster).

Note this is a worst case for rcu-path-walk: lookup of "./file", because
it has to take refcount on the final element. With more elements, rcu
walk should gain the advantage.

creat/unlink is showing the big RCU penalty. However I have penciled
out a working design with Linus of how to do SLAB_DESTROY_BY_RCU.
However it makes the store-free path walking and some inode RCU list
walking a little bit trickier, so I prefer not to dump too much on
at once. There is something that can be done if regressions show up.
I don't anticipate many regressions outside microbenchmarks, and this
is about the absolute worst case.


On to parallel tests. Firstly, the google socket workload.
Running with "NR_THREADS" children, vfs-scale patches do this:

root(a)p7ih06:~/google# time ./google --files_per_cpu 10000 > /dev/null
real 0m4.976s
user 8m38.925s
sys 6m45.236s

root(a)p7ih06:~/google# time ./google --files_per_cpu 20000 > /dev/null
real 0m7.816s
user 11m21.034s
sys 14m38.258s

root(a)p7ih06:~/google# time ./google --files_per_cpu 40000 > /dev/null
real 0m11.358s
user 11m37.955s
sys 28m44.911s

Reducing to NR_THREADS/4 children allows vanilla to complete:

root(a)p7ih06:~/google# time ./google --files_per_cpu 10000
real 1m23.118s
user 3m31.820s
sys 81m10.405s

I was actually surprised it did that well.


Dbench was an interesting one. We didn't manage to stretch the box's
legs, unfortunately! dbench with 1 proc gave about 500MB/s, 64 procs
gave 21GB/s, 128 and throughput dropped dramatically. Turns out that
weird things start happening with rename seqlock versus d_lookup, and
d_move contention (dbench does a sprinkle of renaming). That can be
improved I think, but noth worth bothering with for the time being.

It's not really worth testing vanilla at high dbench parallelism.


Parallel git diff workload looked OK. It seemed to be scaling fine
in the vfs, but it hit a bottlneck in powerpc's tlb invalidation, so
numbers may not be so interesting.


Lastly, some parallel syscall microbenchmarks:

procs vanilla vfs-scale
open-close, seperate-cwd
1 384557.70 355923.82 op/s/proc
NR_CORES 86.63 164054.64 op/s/proc
NR_THREADS 18.68 (ouch!)

open-close, same-cwd
1 381074.32 339161.25
NR_CORES 104.16 107653.05

creat-unlink, seperate-cwd
1 145891.05 104301.06
NR_CORES 29.81 10061.66

creat-unlink, same-cwd
1 129681.27 104301.06
NR_CORES 12.68 181.24

So we can see the single thread performance regressions here, but
the vanilla case really chokes at high CPU counts.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Fri, Jul 23, 2010 at 07:17:46AM -0400, Christoph Hellwig wrote:
> I might sound like a broken record, but if you want to make forward
> progress with this split it into smaller series.

No I appreciate the advice. I put this tree up for people to fetch
without posting patches all the time. I think it is important to
test and to see the big picture when reviewing the patches, but you
are right about how to actually submit patches on the ML.


> What would be useful for example would be one series each to split
> the global inode_lock and dcache_lock, without introducing all the
> fancy new locking primitives, per-bucket locks and lru schemes for
> a start.

I've kept the series fairly well structured like that. Basically it
is in these parts:

1. files lock
2. vfsmount lock
3. mnt refcount
4a. put several new global spinlocks around different parts of dcache
4b. remove dcache_lock after the above protect everything
4c. start doing fine grained locking of hash, inode alias, lru, etc etc
5a, 5b, 5c. same for inodes
6. some further optimisations and cleanups
7. store-free path walking

This kind of sequence. I will again try to submit a first couple of
things to Al soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/