From: Trond Myklebust on
On Sun, 2009-12-13 at 11:19 -0800, Linus Torvalds wrote:
>
> On Sun, 13 Dec 2009, Trond Myklebust wrote:
>
> > On Sun, 2009-12-13 at 20:04 +0100, Frederic Weisbecker wrote:
> > > In the above cases we have the following comment:
> > >
> > > /* Protect inode->i_flock using the BKL */
> > >
> > > And really it doesn't seem to protect anything else,
> > > fortunately it is acquired in a short path.
> >
> > As I said in my reply, this is the tough one, because the BKL protection
> > is imposed by the VFS locking scheme used in fs/locks.c.
> >
> > There is a similar dependency imposed upon fs/lockd/
>
> Note that since NFS seems to be the only one who really cares, I think the
> appropriate course of action is to just replace the BKL - preferably with
> a spinlock that you just drop before you do anything that blocks.
>
> Not only does the BKL already do that (so the locking doesn't change), but
> I think most _users_ of the BKL actually already do the explicit dropping
> of the lock (rather than the implicit one done by schedule()) because it's
> already been a scalability issue and we've had some history of trying
> alternative approaches that didn't do that whole auto-dropping anyway
> (whether those alternate approaches be semaphores or spinlocks).
>
> So don't worry about the "imposed by the VFS" thing. I think you can
> fairly easily change the VFS side.

That comment applies to the NFSv4 code, which applies the BKL only
because we need to traverse the inode->i_flock list. There is no
recursive use of the BKL there, and we can trivially replace it with
whichever new lock that that is chosen to replace the BKL in fs/locks.c.

To fix the lockd module, we have to backport some of the lessons we
learned when writing the NFSv4 file locking code. It won't be a huge
job, but I'd feel uncomfortable if I were to do it in a hurry just in
order to meet the merge window deadline.

Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arjan van de Ven on
On Sun, 13 Dec 2009 20:16:08 +0100
Frederic Weisbecker <fweisbec(a)gmail.com> wrote:

> On Sun, Dec 13, 2009 at 09:55:34AM -0800, Arjan van de Ven wrote:
> > On Sun, 13 Dec 2009 07:58:44 +0100
> > Ingo Molnar <mingo(a)elte.hu> wrote:
> >
> > >
> > > * Linus Torvalds <torvalds(a)linux-foundation.org> wrote:
> > >
> > > > We've had quite a bit of BKL work this merge-window. Maybe we'll
> > > > even get rid of it one of these days. There are "only" about 600
> > > > instances of "lock_kernel()" in the tree right now ;)
> > >
> > > I tend to use unlock_kernel() as the metric. (as it's more
> > > precisely greppable and it is also more indicative of the
> > > underlying complexity of locking, as it gets used more in more
> > > complex scenarios)
> >
> > another metric is... how many times do we take the BKL for some
> > workload. (For example booting or compiling a kernel).
> > A counter like "BKLs-per-second" would be nice to expose
> > (and then we can track that number going up as a regression etc)
>
>
>
> We have the bkl tracepoints for that, attaching an example below,
> blkdev_get/bkldev_put is among the highest consumer for me.

we have a trace, but not a number that anyone can just pull out without
having to go through great lengths to set stuff up... (esp to capture a
boot)...
Adding a counter always to the lock_kernel function should be fine
instead...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Oliver Neukum on
Am Montag, 14. Dezember 2009 06:30:15 schrieb Arjan van de Ven:
> > We have the bkl tracepoints for that, attaching an example below,
> > blkdev_get/bkldev_put is among the highest consumer for me.
>
> we have a trace, but not a number that anyone can just pull out without
> having to go through great lengths to set stuff up... (esp to capture a
> boot)...
> Adding a counter always to the lock_kernel function should be fine
> instead...

But don't we need to know how long it is held? If you just count
you'll make code that drops it while it can look bad.

Regards
Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arjan van de Ven on
On Mon, 14 Dec 2009 11:39:48 +0100
Oliver Neukum <oliver(a)neukum.org> wrote:

> Am Montag, 14. Dezember 2009 06:30:15 schrieb Arjan van de Ven:
> > > We have the bkl tracepoints for that, attaching an example below,
> > > blkdev_get/bkldev_put is among the highest consumer for me.
> >
> > we have a trace, but not a number that anyone can just pull out
> > without having to go through great lengths to set stuff up... (esp
> > to capture a boot)...
> > Adding a counter always to the lock_kernel function should be fine
> > instead...
>
> But don't we need to know how long it is held? If you just count
> you'll make code that drops it while it can look bad.

I think/hope we're well past that. At this point, just the act of taking
it at all is bad.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arnd Bergmann on
On Monday 14 December 2009 05:30:15 Arjan van de Ven wrote:
> On Sun, 13 Dec 2009 20:16:08 +0100 Frederic Weisbecker <fweisbec(a)gmail.com> wrote:
>
> > We have the bkl tracepoints for that, attaching an example below,
> > blkdev_get/bkldev_put is among the highest consumer for me.
>
> we have a trace, but not a number that anyone can just pull out without
> having to go through great lengths to set stuff up... (esp to capture a
> boot)...
> Adding a counter always to the lock_kernel function should be fine
> instead...

FWIW, I've hacked up some debugging logic in lib/kernel_lock.c to count
and measure the BKL. If anyone is interested in this, I can bring the
patch into a form that is usable. The output for a boot followed by
building a kernel on a four-way box with today's git looks like:

fs/ioctl.c:51 vfs_ioctl,5672809,0,9,108968293432,40528
drivers/char/vt_ioctl.c:512 vt_ioctl,15615,0,45,7967944,16400928
fs/locks.c:826 __posix_lock_file,1272,0,57,2815752,2013920
sound/core/sound.c:175 snd_open,1201,0,0,3863432,0
fs/block_dev.c:1192 __blkdev_get,479,26,14,39860472,686636416
drivers/usb/core/devio.c:656 usbdev_open,240,0,0,1640792,0
drivers/usb/core/devio.c:1868 usbdev_ioctl,240,0,0,369368,0
fs/read_write.c:110 default_llseek,185,0,5,88048,669224
drivers/char/tty_io.c:1761 tty_open,136,0,11,2878344,249472048
fs/block_dev.c:1358 __blkdev_put,117,1,1,10654208,3840
drivers/char/tty_io.c:1882 tty_open,106,0,5,44344,32056
drivers/char/tty_io.c:1603 tty_release,99,0,3,363496,8192
drivers/char/tty_io.c:1513 tty_release,99,0,3,639200,104312
kernel/ptrace.c:610 sys_ptrace,98,0,0,588432,0
block/ioctl.c:171 __blkdev_driver_ioctl,40,0,4,2960544,69288
fs/locks.c:1188 __break_lease,25,0,0,14960,0
fs/namespace.c:1650 do_new_mount,14,1,0,403088,0
fs/locks.c:733 flock_lock_file,14,0,0,33072,0
fs/locks.c:2022 locks_remove_flock,9,0,0,17160,0
drivers/char/pty.c:673 ptmx_open,8,0,0,406200,0
drivers/input/input.c:1767 input_open_file,7,0,0,83776,0
drivers/char/tty_io.c:717 disassociate_ctty,5,0,1,2376,322352
lib/kernel_lock.c:236 stat_read,3,0,0,590328,0
kernel/trace/trace.c:721 register_tracer,3,0,1,2420999112,17296
fs/locks.c:647 posix_test_lock,3,0,0,6920,0
drivers/gpu/drm/drm_fops.c:176 drm_stub_open,2,0,0,16712,0
init/main.c:849 kernel_init,1,0,0,334951112,0
init/main.c:546 start_kernel,1,0,0,873891904,0
fs/ext3/super.c:2548 ext3_remount,1,0,0,86592,0
fs/ext3/super.c:2036 ext3_fill_super,1,1,0,2656,0
drivers/gpu/drm/drm_fops.c:473 drm_release,1,0,0,6360,0
block/ioctl.c:316 blkdev_ioctl,1,0,1,848,32352

The comma-separated fields are
1. caller location
2. number of lock_kernel() calls
3. number of times it was called with BKL held already
4. number of times it blocked
5. total time it was held in get_cycles() units, not counting
time spent during release-on-sleep
6. total time that this lock_kernel() was blocked by another
thread holding the BKL.

It would be easy to add some recording of who is nesting
or blocking whom, or which ones call a function that might_sleep().

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/