From: Alan Stern on
On Tue, 8 Dec 2009, Linus Torvalds wrote:

> Side note: if this was a real lock, you'd also needed an smp_wmb() in the
> 'wait_lock()' path after the atomic_inc(), to make sure that others see
> the atomic lock was seen by other people before the suspend started.
>
> In your usage scenario, I don't think it would ever be noticeable, since
> the other users are always going to start running from the same thread
> that did the wait_lock(), so even if they run on other CPU's, we'll have
> scheduled _to_ those other CPU's and done enough memory ordering to
> guarantee that they will see the thing.
>
> So it would be ok in this situation, simply because it acts as an
> initializer and never sees any real SMP issues.

Yes. I would have brought this up, but you made the point for me.

> But it's an example of how you now don't just depend on the locking
> primitives themselves doing the right thing, you end up depending very
> subtly on exactly how the lock is used. The standard locks do have the
> same kind of issue for initializers, but we avoid it elsewhere because
> it's so risky.

No doubt there are other reasons why the "wait-lock" pattern doesn't
get used enough to be noticed.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Alan Stern on
On Tue, 8 Dec 2009, Linus Torvalds wrote:

> > The whole readers vs. writers thing is a non-sequitur.
>
> No it's not.
>
> It's a 100% equivalent problem. It's purely a change of wording. The end
> result is the same.

Well, of course the end result is the same (ignoring bugs) -- that was
the point. It doesn't follow that the two locking mechanisms are 100%
equivalent.

> And note how even though you sprinkled random memory barriers around, you
> still got it wrong.

Yes. That comes of trying to think at the keyboard.

> It's certainly not smaller. It's not faster. It doesn't have support for
> lockdep. And it's BUGGY.

Lockdep will choke on the rwsem approach anyway. It has never been
very good at handling tree-structured locking, especially when there
are non-parent-child interactions. But never mind.

> Really. Tell me why you want to re-implement an existing lock - badly.

I didn't want to. The whole exercise was intended to make a point --
that rwsems do more than we really need here.

> [ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
> that anybody else who sees the new value will be guaranteed to have seen
> anything else the unlocker did.

Yes.

> You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> allow writes to migrate up either. 'atomic_read()' does not imply any
> barriers.

No, that's not needed. Unlike reads, writes can't move in front of
data or control dependencies. Or so I've been lead to believe...

> That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'.

Not really. It also corresponds to a 'write_lock()+write_unlock()' (in
the suspend routine). Are you claiming these two compound operations
are equivalent?

> We
> _could_ expose such a mechanism for rwsem's too, but why do it? It's
> actually nicer to use a real read-lock - and do it _around_ the operation,
> because now the locking also automatically gets things like overlapping
> suspends and resumes right.
>
> (Which you'd obviously hope never happens, but it's nice from a conceptual
> standpoint to know that the locking is robust).

> Take heed. You got it wrong. Admit it. Locking is _hard_. SMP memory
> ordering is HARD.

Oh, there's no question about that. I never seriously intended this
stuff to be adopted. It was just for discussion.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Alan Stern on
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:

> Suppose we use rwsem and during suspend each child uses a down_read() on a
> parent and then the parent uses down_write() on itself. What if, whatever the
> reason, the parent is a bit early and does the down_write() before one of the
> children has a chance to do the down_read()? Aren't we toast?
>
> Do we need any direct protection against that or does it just work itself out
> in a way I just don't see right now?

That's not the way it should be done. Linus had children taking their
parents' locks during suspend, which is simple but leads to
difficulties.

Instead, the PM core should do a down_write() on each device before
starting the device's async suspend routine, and an up_write() when the
routine finishes. Parents should, at the start of their async routine,
do down_read() on each of their children plus whatever other devices
they need to wait for. The core can do the waiting for children part
and the driver's suspend routine can handle any other waiting.

This is a little more awkward because it requires the parent to iterate
through its children. But it does solve the off-tree dependency
problem for suspends.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > Suppose we use rwsem and during suspend each child uses a down_read() on a
> > parent and then the parent uses down_write() on itself. What if, whatever the
> > reason, the parent is a bit early and does the down_write() before one of the
> > children has a chance to do the down_read()? Aren't we toast?
> >
> > Do we need any direct protection against that or does it just work itself out
> > in a way I just don't see right now?
>
> That's not the way it should be done. Linus had children taking their
> parents' locks during suspend, which is simple but leads to
> difficulties.
>
> Instead, the PM core should do a down_write() on each device before
> starting the device's async suspend routine, and an up_write() when the
> routine finishes. Parents should, at the start of their async routine,
> do down_read() on each of their children plus whatever other devices
> they need to wait for. The core can do the waiting for children part
> and the driver's suspend routine can handle any other waiting.
>
> This is a little more awkward because it requires the parent to iterate
> through its children.

I can live with that.

> But it does solve the off-tree dependency problem for suspends.

That's a plus, but I still think we're trying to create a barrier-alike
mechanism using lock.

There's one more possibility to consider, though. What if we use a completion
instead of the flag + wait queue? It surely is a standard synchronization
mechanism and it seems it might work here.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Tue, 8 Dec 2009, Alan Stern wrote:
>
> > You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> > allow writes to migrate up either. 'atomic_read()' does not imply any
> > barriers.
>
> No, that's not needed. Unlike reads, writes can't move in front of
> data or control dependencies. Or so I've been lead to believe...

Sure they can. Control dependencies are trivial - it's called "branch
prediction", and everybody does it, and data dependencies don't exist on
many CPU architectures (even to the point of reading through a pointer
that you loaded).

But yes, on x86, stores only move down. But that's an x86-specific thing.

[ Not that it's also not very common - write buffering is easy and matters
for performance, so any in-order implementation will generally do it. In
contrast, writes moving up doesn't really help peformance and is harder
to do, but can happen with a weakly ordered memory subsystem especially
if you have multi-way caches where some ways are busy and end up being
congested.

So the _common_ case is definitely about delaying writes and doing reads
early if possible. But it's not necessarily at all guaranteed in
general. ]

> > That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'.
>
> Not really. It also corresponds to a 'write_lock()+write_unlock()' (in
> the suspend routine). Are you claiming these two compound operations
> are equivalent?

They have separate semantics, and you just want to pick the one that suits
you. Your counting lock doesn't have the "read_lock+read_unlock" version,
it only has the write_lock/unlock one (ie it requires totally unlocked
thing).

The point being, rwsem's can do everything your counting lock does. And
they already exist. And they already know about all the subtleties of
architecture-specific memory ordering etc.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/