|
Prev: on CONFIG_MM_OWNER=y, kernel panic is possible.
Next: on CONFIG_MM_OWNER=y, kernel panic is possible.
From: Linus Torvalds on 7 May 2008 11:30 On Wed, 7 May 2008, Linus Torvalds wrote: > > But my preferred option would indeed be just turning it back into a > spinlock - and screw latency and BKL preemption - and having the RT people > who care deeply just work on removing the BKL in the long run. Here's a trial balloon patch to do that. Yanmin - this is not well tested, but the code is fairly obvious, and it would be interesting to hear if this fixes the performance regression. Because if it doesn't, then it's not the BKL, or something totally different is going on. Now, we should probably also test just converting the thing to a mutex, to see if that perhaps also fixes it. Linus --- arch/mn10300/Kconfig | 11 ---- include/linux/hardirq.h | 18 ++++--- kernel/sched.c | 27 ++--------- lib/kernel_lock.c | 120 +++++++++++++++++++++++++++++++--------------- 4 files changed, 95 insertions(+), 81 deletions(-) diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig index 6a6409a..e856218 100644 --- a/arch/mn10300/Kconfig +++ b/arch/mn10300/Kconfig @@ -186,17 +186,6 @@ config PREEMPT Say Y here if you are building a kernel for a desktop, embedded or real-time system. Say N if you are unsure. -config PREEMPT_BKL - bool "Preempt The Big Kernel Lock" - depends on PREEMPT - default y - help - This option reduces the latency of the kernel by making the - big kernel lock preemptible. - - Say Y here if you are building a kernel for a desktop system. - Say N if you are unsure. - config MN10300_CURRENT_IN_E2 bool "Hold current task address in E2 register" default y diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 897f723..181006c 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -72,6 +72,14 @@ #define in_softirq() (softirq_count()) #define in_interrupt() (irq_count()) +#if defined(CONFIG_PREEMPT) +# define PREEMPT_INATOMIC_BASE kernel_locked() +# define PREEMPT_CHECK_OFFSET 1 +#else +# define PREEMPT_INATOMIC_BASE 0 +# define PREEMPT_CHECK_OFFSET 0 +#endif + /* * Are we running in atomic context? WARNING: this macro cannot * always detect atomic context; in particular, it cannot know about @@ -79,17 +87,11 @@ * used in the general case to determine whether sleeping is possible. * Do not use in_atomic() in driver code. */ -#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) - -#ifdef CONFIG_PREEMPT -# define PREEMPT_CHECK_OFFSET 1 -#else -# define PREEMPT_CHECK_OFFSET 0 -#endif +#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE) /* * Check whether we were atomic before we did preempt_disable(): - * (used by the scheduler) + * (used by the scheduler, *after* releasing the kernel lock) */ #define in_atomic_preempt_off() \ ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET) diff --git a/kernel/sched.c b/kernel/sched.c index 58fb8af..c51b656 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4567,8 +4567,6 @@ EXPORT_SYMBOL(schedule); asmlinkage void __sched preempt_schedule(void) { struct thread_info *ti = current_thread_info(); - struct task_struct *task = current; - int saved_lock_depth; /* * If there is a non-zero preempt_count or interrupts are disabled, @@ -4579,16 +4577,7 @@ asmlinkage void __sched preempt_schedule(void) do { add_preempt_count(PREEMPT_ACTIVE); - - /* - * We keep the big kernel semaphore locked, but we - * clear ->lock_depth so that schedule() doesnt - * auto-release the semaphore: - */ - saved_lock_depth = task->lock_depth; - task->lock_depth = -1; schedule(); - task->lock_depth = saved_lock_depth; sub_preempt_count(PREEMPT_ACTIVE); /* @@ -4609,26 +4598,15 @@ EXPORT_SYMBOL(preempt_schedule); asmlinkage void __sched preempt_schedule_irq(void) { struct thread_info *ti = current_thread_info(); - struct task_struct *task = current; - int saved_lock_depth; /* Catch callers which need to be fixed */ BUG_ON(ti->preempt_count || !irqs_disabled()); do { add_preempt_count(PREEMPT_ACTIVE); - - /* - * We keep the big kernel semaphore locked, but we - * clear ->lock_depth so that schedule() doesnt - * auto-release the semaphore: - */ - saved_lock_depth = task->lock_depth; - task->lock_depth = -1; local_irq_enable(); schedule(); local_irq_disable(); - task->lock_depth = saved_lock_depth; sub_preempt_count(PREEMPT_ACTIVE); /* @@ -5853,8 +5831,11 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) spin_unlock_irqrestore(&rq->lock, flags); /* Set the preempt count _outside_ the spinlocks! */ +#if defined(CONFIG_PREEMPT) + task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0); +#else task_thread_info(idle)->preempt_count = 0; - +#endif /* * The idle tasks have their own, simple scheduling class: */ diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c index cd3e825..06722aa 100644 --- a/lib/kernel_lock.c +++ b/lib/kernel_lock.c @@ -11,79 +11,121 @@ #include <linux/semaphore.h> /* - * The 'big kernel semaphore' + * The 'big kernel lock' * - * This mutex is taken and released recursively by lock_kernel() + * This spinlock is taken and released recursively by lock_kernel() * and unlock_kernel(). It is transparently dropped and reacquired * over schedule(). It is used to protect legacy code that hasn't * been migrated to a proper locking design yet. * - * Note: code locked by this semaphore will only be serialized against - * other code using the same locking facility. The code guarantees that - * the task remains on the same CPU. - * * Don't use in new code. */ -static DECLARE_MUTEX(kernel_sem); +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag); + /* - * Re-acquire the kernel semaphore. + * Acquire/release the underlying lock from the scheduler. * - * This function is called with preemption off. + * This is called with preemption disabled, and should + * return an error value if it cannot get the lock and + * TIF_NEED_RESCHED gets set. * - * We are executing in schedule() so the code must be extremely careful - * about recursion, both due to the down() and due to the enabling of - * preemption. schedule() will re-check the preemption flag after - * reacquiring the semaphore. + * If it successfully gets the lock, it should increment + * the preemption count like any spinlock does. + * + * (This works on UP too - _raw_spin_trylock will never + * return false in that case) */ int __lockfunc __reacquire_kernel_lock(void) { - struct task_struct *task = current; - int saved_lock_depth = task->lock_depth; - - BUG_ON(saved_lock_depth < 0); - - task->lock_depth = -1; - preempt_enable_no_resched(); - - down(&kernel_sem); - + while (!_raw_spin_trylock(&kernel_flag)) { + if (test_thread_flag(TIF_NEED_RESCHED)) + return -EAGAIN; + cpu_relax(); + } preempt_disable(); - task->lock_depth = saved_lock_depth; - return 0; } void __lockfunc __release_kernel_lock(void) { - up(&kernel_sem); + _raw_spin_unlock(&kernel_flag); + preempt_enable_no_resched(); } /* - * Getting the big kernel semaphore. + * These are the BKL spinlocks - we try to be polite about preemption. + * If SMP is not on (ie UP preemption), this all goes away because the + * _raw_spin_trylock() will always succeed. */ -void __lockfunc lock_kernel(void) +#ifdef CONFIG_PREEMPT +static inline void __lock_kernel(void) { - struct task_struct *task = current; - int depth = task->lock_depth + 1; + preempt_disable(); + if (unlikely(!_raw_spin_trylock(&kernel_flag))) { + /* + * If preemption was disabled even before this + * was called, there's nothing we can be polite + * about - just spin. + */ + if (preempt_count() > 1) { + _raw_spin_lock(&kernel_flag); + return; + } - if (likely(!depth)) /* - * No recursion worries - we set up lock_depth _after_ + * Otherwise, let's wait for the kernel lock + * with preemption enabled.. */ - down(&kernel_sem); + do { + preempt_enable(); + while (spin_is_locked(&kernel_flag)) + cpu_relax(); + preempt_disable(); + } while (!_raw_spin_trylock(&kernel_flag)); + } +} - task->lock_depth = depth; +#else + +/* + * Non-preemption case - just get the spinlock + */ +static inline void __lock_kernel(void) +{ + _raw_spin_lock(&kernel_flag); } +#endif -void __lockfunc unlock_kernel(void) +static inline void __unlock_kernel(void) { - struct task_struct *task = current; + /* + * the BKL is not covered by lockdep, so we open-code the + * unlocking sequence (and thus avoid the dep-chain ops): + */ + _raw_spin_unlock(&kernel_flag); + preempt_enable(); +} - BUG_ON(task->lock_depth < 0); +/* + * Getting the big kernel lock. + * + * This cannot happen asynchronously, so we only need to + * worry about other CPU's. + */ +void __lockfunc lock_kernel(void) +{ + int depth = current->lock_depth+1; + if (likely(!depth)) + __lock_kernel(); + current->lock_depth = depth; +} - if (likely(--task->lock_depth < 0)) - up(&kernel_sem); +void __lockfunc unlock_kernel(void) +{ + BUG_ON(current->lock_depth < 0); + if (likely(--current->lock_depth < 0)) + __unlock_kernel(); } EXPORT_SYMBOL(lock_kernel); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on 7 May 2008 12:30 * Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > I think turning the BKL into a semaphore was fine per se, but that was > when semaphores were fast. hm, do we know it for a fact that the 40% AIM regression is due to the fastpath overhead of the BKL? It would be extraordinary if so. I think it is far more likely that it's due to the different scheduling and wakeup behavior of the new kernel/semaphore.c code. So the fix would be to restore the old scheduling behavior - that's what Yanmin's manual revert did and that's what got him back the previous AIM7 performance. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Matthew Wilcox on 7 May 2008 12:40 On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote: > up() seems to be doing wake-one, FIFO which is nice. Did the > implementation which we just removed also do that? Was it perhaps > accidentally doing LIFO or something like that? If heavily contended, it could do this. up() would increment sem->count and cal __up() which would call wake_up() down() would decrement sem->count. The unlucky task woken by __up would lose the race and go back to sleep. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 7 May 2008 13:00 On Wed, 7 May 2008, Matthew Wilcox wrote: > > If heavily contended, it could do this. It doesn' have to be heavily contended - if it's just hot and a bit lucky, it would potentially never schedule at all, because it would never take the spinlock and serialize the callers. It doesn't even need "unfairness" to work that way. The old semaphore implementation was very much designed to be lock-free, and if you had one CPU doing a lock while another did an unlock, the *common* situation was that the unlock would succeed first, because the unlocker was also the person who had the spinlock exclusively in its cache! The above may count as "lucky", but the hot-cache-line thing is a big deal. It likely "lucky" into something that isn't a 50:50 chance, but something that is quite possible to trigger consistently if you just have mostly short holders of the lock. Which, btw, is probably true. The BKL is normally held for short times, and released (by that thread) for relatively much longer times. Which is when spinlocks tend to work the best, even when they are fair (because it's not so much a fairness issue, it's simply a cost-of-taking-the-lock issue!) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 7 May 2008 13:10
On Wed, 7 May 2008, Linus Torvalds wrote: > > Which, btw, is probably true. The BKL is normally held for short times, > and released (by that thread) for relatively much longer times. Which > is when spinlocks tend to work the best, even when they are fair (because > it's not so much a fairness issue, it's simply a cost-of-taking-the-lock > issue!) ... and don't get me wrong: the old semaphores (and the new mutexes) should also have this property when lucky: taking the lock is often a hot-path case. And the spinlock+generic semaphore thing probably makes that "lucky" behavior be exponentially less likely, because now to hit the lucky case, rather than the hot path having just *one* access to the interesting cache line, it has basically something like 4 accesses (spinlock, count test, count decrement, spinunlock), in addition to various serializing instructions, so I suspect it quite often gets serialized simply because even the "fast path" is actually about ten times as long! As a result, a slow "fast path" means that the thing gets saturated much more easily, and that in turn means that the "fast path" turns into a "slow path" more easily, which is how you end up in the scheduler rather than just taking the fast path. This is why sleeping locks are more expensive in general: they have a *huge* cost from when they get contended. Hundreds of times higher than a spinlock. And the faster they are, the longer it takes for them to get contended under load. So slowing them down in the fast path is a double whammy, in that it shows their bad behaviour much earlier. And the generic semaphores really are slower than the old optimized ones in that fast path. By a *big* amount. Which is why I'm 100% convinced it's not even worth saving the old code. It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with "slow path" other than the fact that it gets to that slow path much more these days. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |