From: Ingo Molnar on

* Linus Torvalds <torvalds(a)linux-foundation.org> wrote:

>
>
> On Wed, 7 May 2008, Ingo Molnar wrote:
> >
> > it was removed by me in the course of this discussion:
> >
> > http://lkml.org/lkml/2008/1/2/58
> >
> > the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the
> > spinlock version] was broken for a longer period of time (it crashed
> > trivially), because nobody apparently used it.
>
> Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL
> was the known-flaky one.
>
> The thread you point to also says that it's PREEMPT_BKL=y that was the
> problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I
> was still using reiserfs."), not the plain spinlock approach.

no, there was another problem (which i couldnt immediately find because
lkml.org only indexes part of the threads, i'll research it some more),
which was some cond_resched() thing in the !PREEMPT_BKL case.

> But it would definitely be interesting to see the crash reports. And
> the help message always said "Say N if you are unsure." even if it
> ended up being marked 'y' by default at some point (and then in
> January was made first unconditional, and then removed entirely)
>
> Because in many ways, the non-preempt BKL is the *much* simpler case.
> I don't see why it would crash - it just turns the BKL into a trivial
> counting spinlock that can sleep.

yeah. The latencies are a different problem, and indeed were reported
against PREEMPT_BKL, and believed to be due to reiser3 and the tty code.
(reiser3 runs almost all of its code under the BKL)

The !PREEMPT_BKL crash was some simple screwup on my part of getting
atomicity checks wrong in cond_resched() - and it went unnoticed for a
long time - or something like that. I'll try to find that discussion.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Wed, 7 May 2008, Ingo Molnar wrote:
>
> There's far more normal mutex fastpath use during an AIM7 run than any
> BKL use. So if it's due to any direct fastpath overhead and the
> resulting widening of the window for the real slowdown, we should see a
> severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

Not agreed.

The BKL is special because it is a *single* lock.

All the "normal" mutex code use fine-grained locking, so even if you slow
down the fast path, that won't cause the same kind of fastpath->slowpath
increase.

In order to see the fastpath->slowpath thing, you do need to have many
threads hitting the same lock: ie the slowdown has to result in real
contention.

Almost no mutexes have any potential for contention what-so-ever, except
for things that very consciously try to hit it (multiple threads doing
readdir and/or file creation on the *same* directory etc).

The BKL really is special.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Wed, 7 May 2008, Linus Torvalds wrote:
>
> All the "normal" mutex code use fine-grained locking, so even if you slow
> down the fast path, that won't cause the same kind of fastpath->slowpath
> increase.

Put another way: let's say that the "good fastpath" is basically a single
locked instruction - ~12 cycles on AMD, ~35 Core 2. That's the
no-bouncing, no-contention case.

Doing it with debugging (call overhead, spinlocks, local irq saving rtc)
will probably easily triple it or more, but we're not changing anything
else. There's no "downstream" effect: the behaviour itself doesn't change.
It doesn't get more bouncing, it doesn't start sleeping.

But what happens if the lock has the *potential* for conflicts is
different.

There, a "longish pause + fast lock + short average code sequece + fast
unlock" is quite likely to stay uncontended for a fair amount of time, and
while it will be much slower than the no-contention-at-all case (because
you do get a pretty likely cacheline event at the "fast lock" part), with
a fairly low number of CPU's and a long enough pause, you *also* easily
get into a pattern where the thing that got the lock will likely also get
to unlock without dropping the cacheline.

So far so good.

But that basically depends on the fact that "lock + work + unlock" is
_much_ shorter than the "longish pause" in between, so that even if you
have <n> CPU's all doing the same thing, their pauses between the locked
section are still bigger than <n> times that short time.

Once that is no longer true, you now start to bounce both at the lock
*and* the unlock, and now that whole sequence got likely ten times slower.
*AND* because it now actually has real contention, it actually got even
worse: if the lock is a sleeping one, you get *another* order of magnitude
just because you now started doing scheduling overhead too!

So the thing is, it just breaks down very badly. A spinlock that gets
contention probably gets ten times slower due to bouncing the cacheline. A
semaphore that gets contention probably gets a *hundred* times slower, or
more.

And so my bet is that both the old and the new semaphores had the same bad
break-down situation, but the new semaphores just are a lot easier to
trigger it because they are at least three times costlier than the old
ones, so you just hit the bad behaviour with much lower loads (or fewer
number of CPU's).

But spinlocks really do behave much better when contended, because at
least they don't get the even bigger hit of also hitting the scheduler. So
the old semaphores would have behaved badly too *eventually*, they just
needed a more extreme case to show that bad behavior.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Linus Torvalds <torvalds(a)linux-foundation.org> wrote:

> > There's far more normal mutex fastpath use during an AIM7 run than
> > any BKL use. So if it's due to any direct fastpath overhead and the
> > resulting widening of the window for the real slowdown, we should
> > see a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?
>
> Not agreed.
>
> The BKL is special because it is a *single* lock.

ok, indeed my suggestion is wrong and this would not be a good
comparison.

another idea: my trial-baloon patch should test your theory too, because
the generic down_trylock() is still the 'fat' version, it does:

spin_lock_irqsave(&sem->lock, flags);
count = sem->count - 1;
if (likely(count >= 0))
sem->count = count;
spin_unlock_irqrestore(&sem->lock, flags);

if there is a noticeable performance difference between your
trial-ballon patch and mine, then the micro-cost of the BKL very much
matters to this workload. Agreed about that?

but i'd be _hugely_ surprised about it. The tty code's BKL use should i
think only happen when a task exits and releases the tty - and a task
exit - even if this is a threaded test (which AIM7 can be - not sure
which exact parameters Yanmin used) - the costs of thread creation and
thread exit are just not in the same ballpark as any BKL micro-costs.
Dunno, maybe i overlooked some high-freq BKL user. (but any such site
would have shown up before) Even assuming a widening of the critical
path and some catastrophic domino effect (that does show up as increased
scheduling) i've never seen a 40% drop like this.

this regression, to me, has "different scheduling behavior" written all
over it - but that's just an impression. I'm not going to bet against
you though ;-)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Wed, 7 May 2008, Ingo Molnar wrote:
>
> no, there was another problem (which i couldnt immediately find because
> lkml.org only indexes part of the threads, i'll research it some more),
> which was some cond_resched() thing in the !PREEMPT_BKL case.

Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe
way. It uses just

!(preempt_count() & PREEMPT_ACTIVE)

to see whether it can schedule, and it should probably use in_atomic()
which ignores the kernel lock.

But right now, that whole thing is disabled if PREEMPT is on anyway, so in
effect (with my test patch, at least) cond_preempt() would just be a no-op
if PREEMPT is on, even if BKL isn't preemptable.

So it doesn't look buggy, but it looks like it might cause longer
latencies than strictly necessary. And if somebody depends on
cond_resched() to avoid some bad livelock situation, that would obviously
not work (but that sounds like a fundamental bug anyway, I really hope
nobody has ever written their code that way).

> The !PREEMPT_BKL crash was some simple screwup on my part of getting
> atomicity checks wrong in cond_resched() - and it went unnoticed for a
> long time - or something like that. I'll try to find that discussion.

Yes, some silly bug sounds more likely. Especially considering how many
different cases there were (semaphores vs spinlocks vs preemptable
spinlocks).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/