|
Prev: on CONFIG_MM_OWNER=y, kernel panic is possible.
Next: on CONFIG_MM_OWNER=y, kernel panic is possible.
From: Matthew Wilcox on 6 May 2008 12:50 On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote: > On Tue, 6 May 2008, Matthew Wilcox wrote: > > I've wanted to fix file locks for a while. Here's a first attempt. > > It was done quickly, so I concede that it may well have bugs in it. > > I found (and fixed) one with LTP. > > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is > there some user that doesn't have the inode info, or does anything that > might cross inode boundaries? /proc/locks and deadlock detection both cross inode boundaries (and even filesystem boundaries). The BKL-removal brigade tried this back in 2.4 and the locking ended up scaling worse than just plonking a single spinlock around the whole thing. > This does seem to drop all locking around the "setlease()" calls down to > the filesystem, which worries me. That said, we clearly do need to do > this. Probably should have done it a long time ago. The only filesystems that are going to have their own setlease methods will be remote ones (nfs, smbfs, etc). They're going to need to sleep while the server responds to them. So holding a spinlock while we call them is impolite at best. > Also, why do people do this: > > > -find_conflict: > > + find_conflict: > > Hmm? So that find_conflict doesn't end up in the first column, which causes diff to treat it as a function name for the purposes of the @@ lines. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: J. Bruce Fields on 6 May 2008 12:50 On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote: > > > On Tue, 6 May 2008, Matthew Wilcox wrote: > > > > I've wanted to fix file locks for a while. Here's a first attempt. > > It was done quickly, so I concede that it may well have bugs in it. > > I found (and fixed) one with LTP. > > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is > there some user that doesn't have the inode info, or does anything that > might cross inode boundaries? The deadlock detection crosses inode boundaries. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Matthew Wilcox on 6 May 2008 13:40 On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote: > Do we actually know that the locks code is implicated in this regression? Not yet. We don't even know it's the BKL. It's just my best guess. We're waiting for the original reporter to run some tests Ingo pointed him at. > I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek() > or remote_llseek(). Correct. > Finally: how come we regressed by swapping the semaphore implementation > anyway? We went from one sleeping lock implementation to another - I'd > have expected performance to be pretty much the same. > > <looks at the implementation> > > down(), down_interruptible() and down_try() should use spin_lock_irq(), not > irqsave. We talked about this ... the BKL actually requires that you be able to acquire it with interrupts disabled. Maybe we should make lock_kernel do this: if (likely(!depth)) { unsigned long flags; local_save_flags(flags); down(); local_irq_restore(flags); } But tweaking down() is not worth it -- we should be eliminating users of both the BKL and semaphores instead. > up() seems to be doing wake-one, FIFO which is nice. Did the > implementation which we just removed also do that? Was it perhaps > accidentally doing LIFO or something like that? That's a question for someone who knows x86 assembler, I think. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on 6 May 2008 13:50 * Andrew Morton <akpm(a)linux-foundation.org> wrote: > Finally: how come we regressed by swapping the semaphore > implementation anyway? We went from one sleeping lock implementation > to another - I'd have expected performance to be pretty much the same. > > <looks at the implementation> > > down(), down_interruptible() and down_try() should use > spin_lock_irq(), not irqsave. > > up() seems to be doing wake-one, FIFO which is nice. Did the > implementation which we just removed also do that? Was it perhaps > accidentally doing LIFO or something like that? i just checked the old implementation on x86. It used lib/semaphore-sleepers.c which does one weird thing: - __down() when it returns wakes up yet another task via wake_up_locked(). i.e. we'll always keep yet another task in flight. This can mask wakeup latencies especially when it takes time. The patch (hack) below tries to emulate this weirdness - it 'kicks' another task as well and keeps it busy. Most of the time this just causes extra scheduling, but if AIM7 is _just_ saturating the number of CPUs, it might make a difference. Yanmin, does the patch below make any difference to the AIM7 results? ( it would be useful data to get a meaningful context switch trace from the whole regressed workload, and compare it to a context switch trace with the revert added. ) Ingo --- kernel/semaphore.c | 10 ++++++++++ 1 file changed, 10 insertions(+) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -261,4 +261,14 @@ static noinline void __sched __up(struct list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); + + if (likely(list_empty(&sem->wait_list))) + return; + /* + * Opportunistically wake up another task as well but do not + * remove it from the list: + */ + waiter = list_first_entry(&sem->wait_list, + struct semaphore_waiter, list); + wake_up_process(waiter->task); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Zhang, Yanmin on 6 May 2008 23:50
On Wed, 2008-05-07 at 10:11 +0800, Zhang, Yanmin wrote: > On Tue, 2008-05-06 at 13:44 +0200, Ingo Molnar wrote: > > * Zhang, Yanmin <yanmin_zhang(a)linux.intel.com> wrote: > > > > > Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with > > > 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium > > > Montecito. Bisect located below patch. > > > > > > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit > > > commit 64ac24e738823161693bf791f87adc802cf529ff > > > Author: Matthew Wilcox <matthew(a)wil.cx> > > > Date: Fri Mar 7 21:55:58 2008 -0500 > > > > > > Generic semaphore implementation > > > > > > After I manually reverted the patch against 2.6.26-rc1 while fixing > > > lots of conflictions/errors, aim7 regression became less than 2%. > > > > hm, which exact semaphore would that be due to? > > > > My first blind guess would be the BKL - there's not much other semaphore > > use left in the core kernel otherwise that would affect AIM7 normally. > > The VFS still makes frequent use of the BKL and AIM7 is very VFS > > intense. Getting rid of that BKL use from the VFS might be useful to > > performance anyway. > > > > Could you try to check that it's indeed the BKL? > > > > Easiest way to check it would be to run AIM7 it on > > sched-devel.git/latest and do scheduler tracing via: > > > > http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt > One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with > 2.6.26-rc1. I have a patch to collect schedule info. With my patch+gprof, I collected some data. Below was outputed by gprof. index % time self children called name 0.00 0.00 2/223305376 __down_write_nested [22749] 0.00 0.00 3/223305376 journal_commit_transaction [10526] 0.00 0.00 6/223305376 __down_read [22745] 0.00 0.00 8/223305376 start_this_handle [19167] 0.00 0.00 15/223305376 sys_pause [19808] 0.00 0.00 17/223305376 log_wait_commit [11047] 0.00 0.00 20/223305376 futex_wait [8122] 0.00 0.00 64/223305376 pdflush [14335] 0.00 0.00 71/223305376 do_get_write_access [5367] 0.00 0.00 84/223305376 pipe_wait [14460] 0.00 0.00 111/223305376 kjournald [10726] 0.00 0.00 116/223305376 int_careful [9634] 0.00 0.00 224/223305376 do_nanosleep [5418] 0.00 0.00 1152/223305376 watchdog [22065] 0.00 0.00 4087/223305376 worker_thread [22076] 0.00 0.00 5003/223305376 __mutex_lock_killable_slowpath [23305] 0.00 0.00 7810/223305376 ksoftirqd [10831] 0.00 0.00 9389/223305376 __mutex_lock_slowpath [23306] 0.00 0.00 10642/223305376 io_schedule [9813] 0.00 0.00 23544/223305376 migration_thread [11495] 0.00 0.00 35319/223305376 __cond_resched [22673] 0.00 0.00 49065/223305376 retint_careful [16146] 0.00 0.00 119757/223305376 sysret_careful [20074] 0.00 0.00 151717/223305376 do_wait [5545] 0.00 0.00 250221/223305376 do_exit [5356] 0.00 0.00 303836/223305376 cpu_idle [4350] 0.00 0.00 222333093/223305376 schedule_timeout [2] [1] 0.0 0.00 0.00 223305376 schedule [1] ----------------------------------------------- 0.00 0.00 2/222333093 io_schedule_timeout [9814] 0.00 0.00 4/222333093 journal_stop [10588] 0.00 0.00 8/222333093 cifs_oplock_thread [3760] 0.00 0.00 14/222333093 do_sys_poll [5513] 0.00 0.00 20/222333093 cifs_dnotify_thread [3733] 0.00 0.00 32/222333093 read_chan [15648] 0.00 0.00 47/222333093 wait_for_common [22017] 0.00 0.00 658/222333093 do_select [5479] 0.00 0.00 2000/222333093 inet_stream_connect [9324] 0.00 0.00 222330308/222333093 __down [22577] [2] 0.0 0.00 0.00 222333093 schedule_timeout [2] 0.00 0.00 222333093/223305376 schedule [1] ----------------------------------------------- 0.00 0.00 1/165565 flock_lock_file_wait [7735] 0.00 0.00 7/165565 __posix_lock_file [23371] 0.00 0.00 203/165565 de_put [4665] 0.00 0.00 243/165565 opost [13633] 0.00 0.00 333/165565 proc_root_readdir [14982] 0.00 0.00 358/165565 write_chan [22090] 0.00 0.00 6222/165565 proc_lookup_de [14908] 0.00 0.00 32081/165565 sys_fcntl [19687] 0.00 0.00 36045/165565 vfs_ioctl [21822] 0.00 0.00 42025/165565 tty_release [20818] 0.00 0.00 48047/165565 chrdev_open [3702] [3] 0.0 0.00 0.00 165565 lock_kernel [3] 0.00 0.00 152987/153190 down [4] ----------------------------------------------- 0.00 0.00 203/153190 __reacquire_kernel_lock [23420] 0.00 0.00 152987/153190 lock_kernel [3] [4] 0.0 0.00 0.00 153190 down [4] 0.00 0.00 153190/153190 __down [22577] ----------------------------------------------- 0.00 0.00 153190/153190 down [4] [22577 0.0 0.00 0.00 153190 __down [22577] 0.00 0.00 222330308/222333093 schedule_timeout [2] As system idle is more than 50%, so the schedule/schedule_timeout caller is important information. 1) lock_kernel causes most schedule/schedule_timeout; 2) When lock_kernel calls down, then __down, __down calls schedule_timeout for lots of times in a loop; 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. -yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |