From: Rafael J. Wysocki on
On Saturday 12 December 2009, Linus Torvalds wrote:
>
> On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> >
> > I'd like to put it into my tree in this form, if you don't mind.
>
> This version still has a major problem, which is not related to
> completions vs rwsems, but simply to the fact that you wanted to do this
> at the generic device layer level rather than do it at the actual
> low-level suspend/resume level.
>
> Namely that there's no apparent sane way to say "don't wait for children".

There is, if the partent would really do something that could disturb the
children. This isn't always the case, but at least in a few important cases
it is (think of a USB controller and USB devices behind it, for example).

I thought we had this discussion already, but perhaps that was with someone
else and in a slightly different context.

The main reasons why I think it's useful to do this at the generic device layer
level are that, if we do it this way:

a. Drivers that don't want to be "asynchronous" don't need to care in any case.

b. Drivers whose suspend and resume routines are guaranteed not to disturb
anyone else can mark their devices as "async" and be done with it, no other
modification of the code is needed (drivers that do nothing in their suspend
and resume routines also fall into this category).

Now, if it's done at the low-level suspend/resume level, a. will not be true
any more in general. Say device A has parent B and the driver of A wants to
suspend asynchrnously. It needs to split its suspend into synchronous and
asynchronous part and at one point start an async thread to run the latter.
Now assume B has a real reason not to suspend before the suspens of A has
finished. Then, the driver of B has to be modified so that it waits for the
A's async suspend to complete (some sort of synchronization between the two
has to be added). So, even if B is "synchronous", its driver has to be
modified to handle the asynchronous suspend of A.

Similarly, b. will no longer be true if it's done at the low-level
suspend/resume level, because now every driver that wants to be
"asynchronous" will need to take care of running an async thread etc.
Moreover, it will need to make sure that the device parent's driver doesn't
need to be modified, because the parent's suspend may do something that will
disturb the child's asynchronous suspend. Furthermore, if the parent's driver
doesn't need to be modified, it will need to consider the parent of the parent,
because that one may potentially disturb the asynchronous suspend of its
grand child and so on up to a device without a parent.

That already is a pain to a driver writer, but the problem you're saying would
be solved by doing this at the low-level suspend/resume level is still there
in general! Namely, go back do the example with devices A and B and say B
_really_ has to wait for A's suspend to complete. Then, since B is after A in
dpm_list, the PM core will not start the suspend of any device after B until
the suspend of B returns. Now, if the suspend of B waits for the suspend of
A, then the PM core will effectively wait for the suspend of A to complete
before suspending any other devices. Worse yet, if that happens, we can't do
anything about it at the low-level suspend/resume level, althouth at the PM
core level we can.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Monday 14 December 2009, Linus Torvalds wrote:
>
> On Mon, 14 Dec 2009, Rafael J. Wysocki wrote:
> >
> > OK, what about a two-pass approach in which the first pass only inits the
> > completions and starts async threads for leaf "async" devices? I think leaf
> > devices are most likely to take much time to suspend, so this will give us
> > a chance to save quite some time.
>
> Why?
>
> Really.

Because the PCI bridges are not the only case where it matters (I'd say they
are really a corner case). Basically, any two async devices separeted by a
series of sync ones are likely not to be suspended (or resumed) in parallel
with each other, because the parent is usually next to its children in dpm_list.
So, if the first device suspends, its "synchronous" parent waits for it and the
suspend of the second async device won't be started until the first one's
suspend has returned. And it doesn't matter at what level we do the async
thing, because dpm_list is there anyway.

As Alan said, the real problem is that we generally can't change the ordering
of dpm_list arbitrarily, because we don't know what's going to happen as a
result. The async_suspend flag tells us, basically, what devices can be safely
moved to different positions in dpm_list without breaking things, as long as
they are not moved behind their parents or in front of their children.

Starting the async suspends upfront would effectively work in the same way as
moving those devices to the beginning of dpm_list without breaking the
parent-child chains, which in turn is likely to allow us to save some extra
time.

That's not only about the PCI bridges, it's more general. As far as your
one-liner is concerned, I'm going to test it, because I think we could use it
anyway.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Tuesday 15 December 2009, Linus Torvalds wrote:
>
> On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Because the PCI bridges are not the only case where it matters (I'd say they
> > are really a corner case). Basically, any two async devices separeted by a
> > series of sync ones are likely not to be suspended (or resumed) in parallel
> > with each other, because the parent is usually next to its children in dpm_list.
>
> Give a real example that matters.

I'll try. Let -> denote child-parent relationships and assume dpm_list looks
like this:

...., A->B->C, D, E->F->G, ...

where A, B, E, F are all async and C, D, G are sync (E, F, G may be USB and
A, B, C may be serio input devices and D is a device that just happens to be in
dpm_list between them). Say A and C take the majority of the total suspend
time and assume we traverse the dpm_list from left to right.

Now, during suspend, C waits for B that waits for A and G waits for F that
waits for E. Moreover, since C is sync, the PM core won't start the suspend
of D until the suspend of C has returned. In turn, since D is sync, the
suspend of E won't be started until the suspend of D has returned. So in
this situation the gain from the async suspends of A, B, E, F is zero.

However, it won't be zero if we start the async suspends of A, B, E, F
upfront.

I'm not sure if this is sufficiently "real life" for you, but this is how
dpm_list looks on one of my test boxes, more or less.

> Really.
>
> How hard can it be to understand: KISS. Keep It Simple, Stupid.
>
> I get really tired of this whole stupid async discussion, because you're
> overdesigning it.
>
> To a first approximation, THE ONLY THING THAT MATTERS IS USB.

If this applies to _resume_ only, then I agree, but the Arjan's data clearly
show that serio devices take much more time to suspend than USB.

But if we only talk about resume, the PCI bridges don't really matter,
because they are resumed before all devices that depend on them, so they don't
really need to wait for anyone anyway.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Tuesday 15 December 2009, Linus Torvalds wrote:
>
> On Mon, 14 Dec 2009, Linus Torvalds wrote:
> >
> > I get really tired of this whole stupid async discussion, because you're
> > overdesigning it.
>
> Btw, this is important. I'm not going to pull even your _current_ async
> stuff if you can't show that you fundamentally UNDERSTAND this fact.

What fact? The only thing that matters is USB? For resume, it is. For
suspend, it clearly isn't.

> Stop making up idiotic complex interfaces. Look at my one-liner patch, and
> realize that it gets you 99% there - the 99% that matters.

I said I was going to use it, but I don't think that's going to be sufficient.

[BTW, I'm not sure what you want to achieve by insulting me. Either you may
want to scare me, but I'm not scared, or you may want to try to make me so
disgusted that I'll just give up and back off, but this is not going to happen
either.]

Insults aside, I'm going to make some measurements to see how much time we can
save.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Tuesday 15 December 2009, Alan Stern wrote:
> On Tue, 15 Dec 2009, Linus Torvalds wrote:
>
> > It's a very subtle theory, and it's not necessarily always 100% true. For
> > example, a cardbus bridge is strictly speaking very much a PCI bridge, but
> > for cardbus bridges we _do_ have a suspend/resume function.
> >
> > And perhaps worse than that, cardbus bridges are one of the canonical
> > examples where two different PCI devices actually share registers. It's
> > quite common that some of the control registers are shared across the two
> > subfunctions of a two-slot cardbus controller (and we generally don't even
> > have full docs for them!)
>
> Okay. This obviously implies that if/when cardbus bridges are
> converted to async suspend/resume, the driver should make sure that the
> lower-numbered devices wait for their sibling higher-numbered devices
> to suspend (and vice versa for resume). Awkward though it may be.
>
> > > The same goes for devices that don't have suspend or resume methods.
> >
> > Yes and no.
> >
> > Again, the "async_suspend" flag is done at the generic device layer, but
> > 99% of all suspend/resume methods are _not_ done at that level: they are
> > bus-specific functions, where the bus has a generic suspend-resume
> > function that it exposes to the generic device layer, and that knows about
> > the bus-specific rules.
> >
> > So if you are a PCI device (to take just that example - but it's true of
> > just about all other buses too), and you don't have any suspend or resume
> > methods, it's actually impossible to see that fact from the generic device
> > layer.
>
> Sure. That's why the async_suspend flag is set at the bus/driver
> level.
>
> > And even when you know it's PCI, our rules are actually not simple at all.
> > Our rules for PCI devices (and this strictly speaking is true for bridges
> > too) are rather complex:
> >
> > - do we have _any_ legacy PM support (ie the "direct" driver
> > suspend/resume functions in the driver ops, rather than having a
> > "struct dev_pm_ops" pointer)? If so, call "->suspend()"
> >
> > - If not - do we have that "dev_pm_ops" thing? If so, call it.
> >
> > - If not - just disable the device entirely _UNLESS_ you're a PCI bridge.
> >
> > Notice? The way things are set up, if you have no suspend routine, you'll
> > not get suspended, but you will get disabled.
> >
> > So it's _not_ actually safe to asynchronously suspend a PCI device if that
> > device has no driver or no suspend routines - because even in the absense
> > of a driver and suspend routines, we'll still least disable it. And if
> > there is some subtle dependency on that device that isn't obvious (say, it
> > might be used indirectly for some ACPI thing), then that async suspend is
> > the wrong thing to do.
> >
> > Subtle? Hell yes.
>
> I don't disagree. However the subtlety lies mainly in the matter of
> non-obvious dependencies. (The other stuff is all known to the PCI
> core.) AFAICS there's otherwise little difference between an async
> routine that does nothing and one that disables the device -- both
> operations are very fast.
>
> The ACPI relations are definitely something to worry about. It would
> be a good idea, at an early stage, to add those dependencies
> explicitly. I don't know enough about them to say more; perhaps Rafael
> does.

It boils down to the fact that for each PCI device known to the ACPI BIOS
there is a "shadow" ACPI device that generally has its own suspend/resume
callbacks and these "shadow" devices are members of the ACPI subtree
of the device tree (ie. they have parents and so on).

Now, when I worked on the first version of async suspend/resume, I noticed
that if those "shadow" ACPI devices did not wait for their PCI counterparts to
suspend, things broke badly. The reason probably wasn't related to what they
did in their suspend/resume callbacks, because they are usually empty, but it
was rather related to the dependencies between devices in the ACPI subtree
(so, generally speaking, it seems the entire ACPI subtree of the device tree
should be suspended after the entire PCI subtree).

That obviously requires more investigation, though.

> As for other non-obvious dependencies... Who knows? Probably the only
> way to find them is by experimentation. My guess is that they will
> turn out to be connected mostly with "high-level" devices: system
> devices, things on the motherboard -- generally speaking, stuff close
> to the CPU. Relatively few will be associated with devices below the
> level of a PCI device or equivalent.
>
> Ideally we would figure out how to do the slow devices in parallel
> without interference from fast devices having unknown dependencies.
> Unfortunately this may not be possible.

I really expect to see those "unknown dependencies" in the _noirq
suspend/resume phases and above. [The very fact they exist is worrisome,
because that's why we don't know why things work on one system and don't
work on another, although they appear to be very similar.]

> > So the whole thing about "we can do PCI bridges asynchronously because
> > they are obviously no-op" is kind of true - except for the "obviously"
> > part. It's not obvious at all. It's rather subtle.
> >
> > As an example of this kind of subtlety - iirc PCIE bridges used to have
> > suspend and resume bugs when we initially switched over to the "new world"
> > suspend/resume exactly because they actually did things at "suspend" time
> > (rather than suspend_late), and that broke devices behind them (this was
> > not related to async, of course, but the point is that even when you look
> > like a PCI bridge, you might be doing odd things).

Well, those "pcieport devices" still are the children of PCIe ports, although
physically they just correspond to different sets of registers within the
ports' config spaces (_that_ is overdesigned IMnsHO) and they are "suspended"
during the regular suspend of their PCIe port "parents".

> > So just saying "let's do it asynchronously" is _not_ always guaranteed to
> > be the right thing at all. It's _probably_ safe for at least regular PCI
> > bridges. Cardbus bridges? Probably not, but since most modern laptop have
> > just a single slot - and people who have multiple slots seldom use them
> > all - most people will probably never see the problems that it _could_
> > introduce.
> >
> > And PCIE bridges? Should be safe these days, but it wasn't quite as
> > obvious, because a PCIE bridge actually has a driver unlike a regular
> > plain PCI-PCI bridge.
> >
> > Subtle, subtle.
>
> Indeed. Perhaps you were too hasty in suggesting that PCI bridges
> should be async.
>
> It would help a lot to see some device lists for typical machines. (If
> there are such things.) Otherwise we are just blowing gas.
>
> > > There remains a separate question: Should async devices also be forced
> > > to wait for their children? I don't see why not. For PCI bridges it
> > > won't make any significant difference. As long as the async code
> > > doesn't have to do anything, who cares when it runs?
> >
> > That's why I just set the "async_resume = 1" thing.
> >
> > But there might actually be reasons why we care. Like the fact that we
> > actually throttle the amount of parallel work we do in async_schedule().
> > So doing even a "no-op" asynchronously isn't actually a no-op: while it is
> > pending (and those things can be pending for a long time, since they have
> > to wait for those slow devices underneath them), it can cause _other_
> > async work - that isn't necessarily a no-op at all - to be then done
> > synchronously.
> >
> > Now, admittedly our async throttling limits are high enough that the above
> > kind of detail will probably never ever realy matter (default 256 worker
> > threads etc). But it's an example of how practice is different from theory
> > - in _theory_ it doesn't make any difference if you wait for something
> > asynchronously, but in practice it could make a difference under some
> > circumstances.
>
> We certainly shouldn't be worried about side effects of async
> throttling as this stage. KISS works both ways: Don't overdesign, and
> don't worry about things that might crop up when you expand the design.
>
> We have strayed off the point of your original objection: not providing
> a way for devices to skip waiting for their children. This really is a
> separate issue from deciding whether or not to go async. For example,
> your proposed patch makes PCI bridges async but doesn't allow them to
> avoid waiting for children. IMO that's a good thing.
>
> The real issue is "blockage": synchronous devices preventing
> possible concurrency among async devices. That's what you thought
> making PCI bridges async would help.
>
> In general, blockage arises in suspend when you have an async child
> with a synchronous parent. The parent has to wait for the child, which
> might take a long time, thereby delaying other unrelated devices.

Exactly, but the Linus' point seems to be that's going to be rare and we
should be able to special case all of the interesting cases.

> (This explains why you wanted to make PCI bridges async -- they are the
> parents of USB controllers.) For resume it's the opposite: an async
> parent with synchronous children.

Is that really going to happen in practice? I mean, what would be the point?

> Thus, while making PCI bridges async might make suspend faster, it probably
> won't help much with resume speed. You'd have to make the children of USB
> devices (SCSI hosts, TTYs, and so on) async. Depending on the order of
> device registration, of course.
>
> Apart from all this, there's a glaring hole in the discussion so far.
> You and Arjan may not have noticed it, but those of us still using
> rotating media have to put up with disk resume times that are a factor
> of 100 (!) larger than USB resume times. That's where the greatest
> gains are to be found.

I guess so.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/