From: Peter Zijlstra on
On Thu, 2010-05-20 at 16:48 -0400, Chris Mason wrote:
>
> This is more of a starting point than a patch, but it is something I've
> been meaning to look at for a long time. Many different workloads end
> up hammering very hard on try_to_wake_up, to the point where the
> runqueue locks dominate CPU profiles.

Right, so one of the things that I considered was to make p->state an
atomic_t and replace the initial stage of try_to_wake_up() with
something like:

int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
{
int state = atomic_read(&p->state);

do {
if (!(state & mask))
return 0;

state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
} while (state != TASK_WAKING);

/* do this pending queue + ipi thing */

return 1;
}

Also, I think we might want to put that atomic single linked list thing
into some header (using atomic_long_t or so), because I have a similar
thing living in kernel/perf_event.c, that needs to queue things from NMI
context.

The advantage of doing basically the whole enqueue on the remote cpu is
less cacheline bouncing of the runqueue structures.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-05-20 at 23:09 +0200, Peter Zijlstra wrote:

> int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
> {
> int state = atomic_read(&p->state);
>
> do {
> if (!(state & mask))
> return 0;
>
> state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
> } while (state != TASK_WAKING);

cpu = select_task_rq()

and then somehow see we get set_task_cpu() done without races :-)

> /* do this pending queue + ipi thing */
>
> return 1;
> }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Mason on
On Thu, May 20, 2010 at 11:23:12PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-05-20 at 23:09 +0200, Peter Zijlstra wrote:
>
> > int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
> > {
> > int state = atomic_read(&p->state);
> >
> > do {
> > if (!(state & mask))
> > return 0;
> >
> > state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
> > } while (state != TASK_WAKING);
>
> cpu = select_task_rq()
>
> and then somehow see we get set_task_cpu() done without races :-)
>
> > /* do this pending queue + ipi thing */
> >
> > return 1;
> > }

I tried not to set the task waking, since I was worried about races with
us getting queued somewhere else. But, I don't have a good handle on
all of that so I kind of chickened out. That's why my code falls back
to the full ttwu in a few cases.

Do you think the above could be an addition to my patch or that it's
required for my patch to work well?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Mason on
On Thu, May 20, 2010 at 11:09:46PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-05-20 at 16:48 -0400, Chris Mason wrote:
> >
> > This is more of a starting point than a patch, but it is something I've
> > been meaning to look at for a long time. Many different workloads end
> > up hammering very hard on try_to_wake_up, to the point where the
> > runqueue locks dominate CPU profiles.
>
> Right, so one of the things that I considered was to make p->state an
> atomic_t and replace the initial stage of try_to_wake_up() with
> something like:
>
> int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
> {
> int state = atomic_read(&p->state);
>
> do {
> if (!(state & mask))
> return 0;
>
> state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
> } while (state != TASK_WAKING);
>
> /* do this pending queue + ipi thing */
>
> return 1;
> }
>
> Also, I think we might want to put that atomic single linked list thing
> into some header (using atomic_long_t or so), because I have a similar
> thing living in kernel/perf_event.c, that needs to queue things from NMI
> context.

So I've done three of these cmpxchg lists recently...but they have all
been a little different. I went back and forth a bunch of times about
using a list_head based thing instead to avoid the walk for list append.
I really don't like the walk.

But, what makes this one unique is that I'm using a cmpxchg on the list
pointer in the in task struct to take ownership of this task struct.
It is how I avoid concurrent lockless enqueues.

Your fiddling with the p->state above would let me avoid that.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Stijn Devriendt on
On Thu, May 20, 2010 at 10:48 PM, Chris Mason <chris.mason(a)oracle.com> wrote:
> I think we probably want to add a way to wait just for a little while as
> a more lightweight operation (less balancing etc) but this patch doesn't
> do that. �It only tries to make the act of waking someone up less
> expensive by avoiding the runqueue lock when we're on a different CPU
> than the process we want to wake.
>
> I do this with a per-runqueue list and some cmpxchg tricks. �Basically
> all the other CPUs will toss a given process they want to wakeup onto
> the destination per-runqueue list. �Later on, when that runqueue is
> finding a new task to run, it processes the list in bulk.

I actually have the reverse lying around somewhere (even more broken, probably)
to allow nested wakeups from the scheduler.

The issue I want to tackle is waking up processes when others go to sleep.
This means try_to_wake_up() from inside the runqueue lock.

I used a simple per_cpu taskqueue where tasks are put on when waking
during schedule(). At the end of schedule() I empty the list and reschedule
as one of the newly woken tasks may be higher prio.

I'm wondering if both approaches can be merged by checking this list before
and after every schedule().

Stijn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/