Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33) [Kernel]

Prev: futex: Take mmap_sem for get_user_pages in fault_in_user_writeable
Next: [PATCH v9 0/8] Loongson: YeeLoong: add platform drivers

From: Linus Torvalds on 8 Dec 2009 17:20

On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > Sure they can. Control dependencies are trivial - it's called "branch
> > prediction", and everybody does it, and data dependencies don't exist on
> > many CPU architectures (even to the point of reading through a pointer
> > that you loaded).
>
> Wait a second. Are you saying that with code like this:
>
> if (x == 1)
> y = 5;
>
> the CPU may write to y before it has finished reading the value of x?

Well, in a way. The branch may have been predicted, and the CPU can
_internally_ have done the 'y=5' thing into a write buffer before it even
did the read.

Some time later it will have to _verify_ the prediction and then perhaps
kill the write before it makes it to a data structure that is visible to
others, but internally from the CPU standpoint, yes, the write could have
happened before the read.

Now, whether that write is "before" or "after" the read is debatable. But
one way of looking at it is certainly that the write took place earlier,
and the read might have just caused it to be undone.

And there are real effects of this - looking at the bus, you might have a
bus transaction to get the cacheline that contains 'y' for exclusive
access happen _before_ the bus transaction that reads in the value of 'x'
(but you'd never see the writeout of that '5' before).

> And this write is visible to other CPUs, so that if x was initially 0
> and a second CPU sets x to 1, the second CPU may see y == 5 before it
> executes the write to x (whatever that may mean)?

Well, yes and no. CPU1 above won't release the '5' until it has confirmed
the '1' (even if it does so by reading it late). but assuming the other
CPU also does speculation, then yes, the situation you describe could
happen. If the other CPU does

z = y;
x = 1;

then it's certainly possible that 'z' contains 5 at the end (even if both
x and y started out zero). Because now the read of 'y' on that other CPU
might be delayed, and the write of 'x' goes ahead, CPU1 sees the 1, and
commits its write of 5, sp when CPU2 gets the cacheline, z will now
contain 5.

Is it likely? No. CPU microarchitectures aim to do reads early, and writes
late. Reads are on the critical path, writes can be buffered. But you can
basically get into "impossible" situations where a write that was _later_
in the instruction stream than a read (on CPU2, the 'store 1 to x' would
be after the load of 'y' from memory) could show up in the other order on
another CPU.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 8 Dec 2009 17:30

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> these routines were not intended to be called with interrupts disabled,
> but that requirement doesn't seem to be documented. And it isn't a
> natural requirement anyway.

'complete()' is supposed to be callable from interrupts, but the waiting
ones aren't. But 'complete()' is all you should need to call from
interrupts, so that's fine.

So I think completions should work, if done right. That whole "make the
parent wait for all the children to complete" is fine in that sense. And
I'll happily take such an approach if my rwlock thing doesn't work.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 8 Dec 2009 17:40

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> Suppose A and B are unrelated devices and we need to impose the
> off-tree constraint that A suspends after B.

Ah. Ok, I can imagine the off-tree constraints, but part of my "keep it
simple" was to simply not do them. If there are constraints that aren't
in the topology of the tree, then I simply don't think that async is worth
it in the first place.

> You misunderstand. The suspend algorithm will look like this:
>
> dpm_suspend()
> {
> list_for_each_entry_reverse(dpm_list, dev) {
> down_write(dev->lock);
> async_schedule(device_suspend, dev);
> }
> }
>
> device_suspend(dev)
> {
> device_for_each_child(dev, child) {
> down_read(child->lock);
> up_read(child->lock);
> }
> dev->suspend(dev); /* May do off-tree down+up pairs */
> up_write(dev->lock);
> }

Ok, so the above I think work (and see my previous email: I think
completions would be workable there too).

It's just that I think the "looping over children" is ugly, when I think
that by doing it the other way around you can make the code simpler and
only depend on the PM device list and a simple parent pointer access.

I also think that you are wrong that the above somehow protects against
non-topological dependencies. If the device you want to keep delay
yourself suspending for is after you in the list, the down_read() on that
may succeed simply because it hasn't even done its down_write() yet and
you got scheduled early.

But I guess you could do that by walking the list twice (first to lock
them all, then to actually call the suspend function). That whole
two-phase thing, except the first phase _only_ locks, and doesn't do any
callbacks.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rafael J. Wysocki on 8 Dec 2009 17:40

On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Linus Torvalds wrote:
>
> > On Tue, 8 Dec 2009, Alan Stern wrote:
> > >
> > > That's not the way it should be done. Linus had children taking their
> > > parents' locks during suspend, which is simple but leads to
> > > difficulties.
> >
> > No it doesn't. Name them.
>
> Well, one difficulty. It arises only because we are contemplating
> having the PM core fire up the async tasks, rather than having the
> drivers' suspend routines launch them (the way your original proposal
> did -- the difficulty does not arise there).
>
> Suppose A and B are unrelated devices and we need to impose the
> off-tree constraint that A suspends after B. With children taking
> their parent's lock, the way to prevent A from suspending too soon is
> by having B's suspend routine acquire A's lock.
>
> But B's suspend routine runs entirely in an async task, because that
> task is started by the PM core and it does the method call. Hence by
> the time B's suspend routine is called, A may already have begun
> suspending -- it's too late to take A's lock. To make the locking
> work, B would have to acquire A's lock _before_ B's async task starts.
> Since the PM core is unaware of the off-tree dependency, there's no
> simple way to make it work.

Do not set async_suspend for B and instead start your own async thread
from its suspend callback. The parent-children synchronization is done by the
core anyway (at least I'd do it that way), so the only thing you need to worry
about is the extra dependency.

> > That just complicates things. Compare to my simple locking scheme I've
> > quoted several times.
>
> It is a little more complicated in that it involves explicitly
> iterating over children. But it is simpler in that it can use
> completions instead of rwsems and it avoids the off-tree dependency
> problem described above.

I would be slightly more comfortable using completions, but the rwsem-based
approach is fine with me as well.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Alan Stern on 8 Dec 2009 21:20

On Tue, 8 Dec 2009, Linus Torvalds wrote:

> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> > these routines were not intended to be called with interrupts disabled,
> > but that requirement doesn't seem to be documented. And it isn't a
> > natural requirement anyway.
>
> 'complete()' is supposed to be callable from interrupts, but the waiting
> ones aren't. But 'complete()' is all you should need to call from
> interrupts, so that's fine.

And try_wait_for_completion()? The fact that it doesn't block makes it
interrupt-safe. What's the point of having an interrupt-safe routine
that you can't call from within interrupt handlers?

Even if nobody uses it that way now, there's no guarantee somebody
won't attempt it in the future.

> So I think completions should work, if done right. That whole "make the
> parent wait for all the children to complete" is fine in that sense. And
> I'll happily take such an approach if my rwlock thing doesn't work.

In principle the two approaches could be combined: Add an rwsem for use
by children and a completion for off-tree[*] use. But that would
certainly be overkill. Looping over children doesn't take a
tremendous amount of time compared to a full system suspend.

Alan Stern

[*] "Off-tree" isn't really an appropriate term; these devices aren't
"off" the tree. "Non-tree" would be better.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: futex: Take mmap_sem for get_user_pages in fault_in_user_writeable
Next: [PATCH v9 0/8] Loongson: YeeLoong: add platform drivers