From: Raistlin on
On Sat, 2010-07-10 at 09:09 +0200, Luca Abeni wrote:
> > - do you think it could be useful to have a different syscall to deal
> > with the period parameter (if it's different from deadline), e.g.,
> > something useful to make the task periodic as you have (if I remember
> > well) in Xenomai or RTAI?
> Maybe I am confused because I missed the initial part of the discussion,
> but here I think there is the risk to mix two different concepts: the
> "reservation period" (that is, the period P used to postpone the
> scheduling deadline when the budget arrives to 0), and the "task
> period" (which has to do with the periodicity of tasks activations).
>
Agree, this is also what I fear...

> For
> implementing a periodic behaviour in the task (this is, AFAIK, what RTAI
> similar API provide), no new syscall is needed: clock_nanosleep() is
> enough. See http://www.disi.unitn.it/~abeni/RTOS/rtapi.pdf for a
> (stupid) example.
>
Yep, agree again, just wanted to see what other folks' thoughts
were. :-)

> > If you think it's worth doing that, do you think the
> > task_wait_interval() syscall that we already added could/should do
> > the job?
> I do not remember what task_wait_interval() does :)
> Is it the syscall you added to indicate the end of a job?
>
Yep. And it can be useful for that purpose, or not being used at all.

> I think if you want a different P_i and D_i you can use D_i for
> generating new scheduling deadlines on task arrivals as "d = t + D_i",
> and P_i to postpone the scheduling deadlines as "d = d + T_i" when the
> budget is 0.
>
Yes, that's exactly what we wanted to do, considered it's also a very
small and easy to achieve behavioural change...

> Depending on the replenishment amount you use, you might need to modify
> the admission test as "Sum_i(Q_i/min{P_i,D_i}) < 1" or not (if you
> always replenish to Q_i, then you need a modified admission test;
> otherwise, you can compute the replenishment amount so that the
> admission test is unaffected).
>
Well, as I tried to point out in the other e-mails, there also are other
problems with the admission test... Let's see if we find a consensus on
how to proceed... :-)

Thanks and regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / raistlin(a)ekiga.net /
dario.faggioli(a)jabber.org
From: Peter Zijlstra on
On Sat, 2010-07-10 at 11:01 +0200, Raistlin wrote:
> On Fri, 2010-07-09 at 18:35 +0200, Peter Zijlstra wrote:
> > I think the easiest path for now would indeed be to split between hard
> > and soft rt tasks, and limit hard to d==p, and later worry about
> > supporting d<p for hard.
> >
> Mmm... I see... Are you thinking of another scheduling class? Or maybe
> just another queue with "higher priority" inside the same scheduling
> class (sched_dl.c)?

Inside the same class, since as you say that would allow sharing lots of
things, also conceptually it makes sense as the admission tests would
really have to share a lot of data between them anyway.

> Maybe having two policies inside the same class (maybe handled in
> separate queues/rb-trees) might save a lot of code duplication.
> If we want to go like this, suggestions on the name(s) of the new (or of
> both) policy(-ies) are more than welcome. :-D

Right, so that's a good point, I'm wondering if we should use two
separate policies or use the one policy, SCHED_DEADLINE, and use flags
to distinguish between these two uses.

Anyway, that part is the easy part to implement and shouldn't take more
than a few minutes to flip between one and the other.

> > The idea is that we approximate G-EDF by moving tasks around, but Dario
> > told me the admission tests are still assuming P-EDF.
> >
> Yep, as said above, that's what we've done since now. Regarding
> "partitioned admission", let me try to explain this.
>
> You asked me to use sched_dl_runtime_us/sched_dl_period_us to let people
> decide how much bandwidth should be devoted to EDF tasks. This obviously
> yields to _only_one_ bandwidth value that is then utilized as the
> utilization cap on *each* CPU, mainly for consistency reasons with
> sched_rt_{runtime,period}_us. At that time I was using such value as the
> "overall EDF bandwidth", but I changed to your suggested semantic.

But if you have a per-cpu bandwidth, and the number of cpus, you also
have the total amount of bandwidth available to G-EDF, no?

> With global scheduling in place, we have this new situation. A task is
> forked on a CPU (say 0), and I allow that if there's enough bandwidth
> for it on that processor (and obviously, if yes, I also consume such
> amount of bw). When the task is dynamically migrated to CPU 1 I have two
> choices:
> (a) I move the bandwidth it occupies from 0 to 1 or,
> (b) I leave it (the bw, not the task) where it is, on 0.

Well, typically G-EDF doesn't really care about what cpu runs what, as
long as the admission thing is respected and we maintain the
smp-invariant of running the n<=m highest 'prio' tasks on m cpus.

So it really doesn't matter how we specify the group budget, one global
clock or one clock per cpu, if we have the number of cpus involved we
can convert between those.

(c) use a 'global' bw pool and be blissfully ignorant of the
per-cpu things?

> If we want something better I cannot think on anything that doesn't
> include having a global (per-domain should be fine as well) mechanism
> for bandwidth accounting...

Right, per root_domain bandwidth accounting for admission should be
perfectly fine.

> > Add to that the interesting problems of task affinity and we might soon
> > all have a head-ache ;-)
> >
> We right now support affinity, i.e., tasks will be pushed/pulled to/by
> CPUs where they can run. I'm not aware of any academic work that
> analyzes such a situation, but this doesn't mean we can't figure
> something out... Just to give people an example of "why real-time
> scheduling theory still matters"!! ;-P ;-P

Hehe, I wouldn't at all mind dis-allowing random affinity masks and only
deal with 1 cpu or 'all' cpus for now.

But yeah, if someone can come up with something clever, I'm all ears ;-)

> > One thing we can do is limit the task affinity to either 1 cpu or all
> > cpus in the load-balance domain. Since there don't yet exist any
> > applications we can disallow things to make life easier.
> >
> > If we only allow pinned tasks and free tasks, splitting the admission
> > test in two would suffice I think, if keep one per-cpu utilization
> > measure and use the maximum of these over all cpus to start the global
> > utilization measure, things ought to work out.
> >
> Ok, that seems possible to me, but since I have to write the code you
> must tell me what you want the semantic of (syswide and per-group)
> sched_dl_{runtime,period} to become and how should I treat them! :-)

Right, so for the system-wide and group bandwidth limits I think we
should present them as if there's one clock, and let the scheduler sort
out how many cpus are available to make it happen.

So we specify bandwidth as if we were looking at our watch, and say,
this here group can consume 30 seconds every 2 minutes. If the
load-balance domains happen to be larger than 1 cpu, hooray we can run
more tasks and the total available bandwidth simply gets multiplied by
the number of available cpus.

Makes sense?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Sat, 2010-07-10 at 09:11 +0200, Luca Abeni wrote:
> On Fri, 2010-07-09 at 16:24 +0200, Peter Zijlstra wrote:
> > On Fri, 2010-07-09 at 15:38 +0200, Raistlin wrote:
> > > Basically, from the scheduling point of view, what it could happen is
> > > that I'm still _NOT_ going to allow a task with runtime Q_i, deadline
> > > D_i and period P_i to use more bandwidth than Q_i/P_i, I'm still using D
> > > for scheduling but the passing of the simple in-kernel admission test
> > > Sum_i(Q_i/P_i)<1 won't guarantee that the task will always finish its
> > > jobs before D.
> >
> > But the tardiness would still be bounded, right? So its a valid Soft-RT
> > model?

> I think that if Sum_i(Q_i/P_i)<1 but Sum_i(Q_i/min{P_i,D_i})>=1 then you
> can have sporadic deadline misses, but it should still be possible to
> compute an upper bound for the tardiness.
> But this is just a feeling, I have no proof... :)

The paper referenced by Bjoern yesterday mentioned that Baruah et al.
did that proof.

"For the considered case di < pi , Baruah et al. [7] showed
that the complexity increases considerably. However, as-
suming the processor utilization to be strictly less than 1,
Baruah et al. [6, 7] proved that if a deadline is missed,
this happens within a maximum time upper bound which
can be computed."


[6] S. Baruah, A. Mok, and L. Rosier. Preemptively scheduling
hard-real-time sporadic tasks on one processor. Proceedings
of the 11th IEEE Real-Time Systems Symposium, December 1990.

[7] S. Baruah, L. Rosier, and R. Howell. Algorithms and
complexity concerning the preemptive scheduling of peri-
odic real-time tasks on one processor. Real-Time Systems,
2(4):301–324, November 1990.


It also jives well with that I remember from reading through Jim's
papers on Soft-RT G-EDF.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Raistlin on
On Sat, 2010-07-10 at 12:28 +0200, Peter Zijlstra wrote:
> > Mmm... I see... Are you thinking of another scheduling class? Or maybe
> > just another queue with "higher priority" inside the same scheduling
> > class (sched_dl.c)?
>
> Inside the same class, since as you say that would allow sharing lots of
> things, also conceptually it makes sense as the admission tests would
> really have to share a lot of data between them anyway.
>
Ok, fine. So my plan is to let what I have out really as fast as I can
so that you can see the progresses we made, and then start working on
this new thing...

> Right, so that's a good point, I'm wondering if we should use two
> separate policies or use the one policy, SCHED_DEADLINE, and use flags
> to distinguish between these two uses.
>
> Anyway, that part is the easy part to implement and shouldn't take more
> than a few minutes to flip between one and the other.
>
Yes, that will be need very low effort to change.

> But if you have a per-cpu bandwidth, and the number of cpus, you also
> have the total amount of bandwidth available to G-EDF, no?
>
That's for sure true. If you like that I can add something like it quite
easily... I'll try to do that on a per-domain basis, instead of truly
fully global.

> So it really doesn't matter how we specify the group budget, one global
> clock or one clock per cpu, if we have the number of cpus involved we
> can convert between those.
>
> (c) use a 'global' bw pool and be blissfully ignorant of the
> per-cpu things?
>
Ok, if it's not a problem having a global pool I think I'll keep the per
CPU bw specification as well as per CPU bw availability. Not only
because I like multiplication more than divisions, but since I think
they could be useful if someone want to achieve a partitioned behaviour
through setting affinities accordingly.

> > Ok, that seems possible to me, but since I have to write the code you
> > must tell me what you want the semantic of (syswide and per-group)
> > sched_dl_{runtime,period} to become and how should I treat them! :-)
>
> Right, so for the system-wide and group bandwidth limits I think we
> should present them as if there's one clock, and let the scheduler sort
> out how many cpus are available to make it happen.
>
As said above, I agree and I'll do exactly that...

> So we specify bandwidth as if we were looking at our watch, and say,
> this here group can consume 30 seconds every 2 minutes. If the
> load-balance domains happen to be larger than 1 cpu, hooray we can run
> more tasks and the total available bandwidth simply gets multiplied by
> the number of available cpus.
>
> Makes sense?
>
Yep. Thanks! :-)

Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / raistlin(a)ekiga.net /
dario.faggioli(a)jabber.org
From: Peter Zijlstra on
On Sat, 2010-07-10 at 09:50 +0200, Raistlin wrote:

> Hey, fine, where's the problem? :-P

We're talking about it.. the exact semantics and the reasons
therefore ;-)

> > What are the exact semantics of this extra proposed syscall?
> >
> Right now, it is:
> task_wait_interval(t) --> "wake me up at the first instant after t when
> you can give me my full runtime"
>
> > What exactly are the benefits over not having it, and simply rely on the
> > task to not wake up more often, but if it does have it run into the lack
> > of budget and sort it that way?
> >
> What you're saying obviously will always work, and it is actually a
> quite common usage pattern (we use it like that a lot! :-)).
>
> The new syscall might help when it is important for a task to
> synchronize with the budget provisioning mechanism. It might be
> uncommon, but there could be situations --more in hard than in soft
> scenarios-- where you want to be sure that you're next job (and all the
> subsequent ones, if you behave well) will get its full runtime, even if
> this means waiting a little bit.
>
> what I was wondering was if this semantic should be modified by the
> introduction of the "period", but I also agree with Luca that we must do
> our best to avoid confusion!

Right, so I would actually expect RT job release to be triggered by
external events (say interrupts) more than on their own. And when its an
external event I don't really see the use of this new syscall.

I guess I'm asking for what reason RT tasks would be ever be
self-releasing, it seems, odd..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/