async: use workqueue for worker pool [Kernel]

Prev: workqueue: update cwq alignement
Next: [PATCH] VMware balloon: force compiling as a module

From: Tejun Heo on 29 Jun 2010 14:20

Hello, Arjan.

On 06/29/2010 08:07 PM, Arjan van de Ven wrote:
> we might be talking past eachother. ;-)
>
> let me define an example that is simple so that we can get on the same page
>
> assume a system with "enough" cpus, say 32.
> lets say we have 2 async tasks, that each do an mdelay(1000); (yes I
> know stupid, but exagerating things makes things easier to talk about)

That's the main point to discuss tho. If you exaggerate the use case
out of proportion, you'll end up with something which in the end is
useful only in the imagination and we'll be doing things just because
we can. Creating full number of unbound threads might look like a
good idea to extract maximum cpu parallelism if you exaggerate the use
case like the above but with the current actual use case, it's not
gonna buy us anything and might even cost us more via unnecessary
thread creations.

So, let's talk about whether it's _actually_ useful for the current
use cases. If so, sure, let's do it that way. If not, there is no
reason to go there, right?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Arjan van de Ven on 29 Jun 2010 14:30

On 6/29/2010 11:15 AM, Tejun Heo wrote:
> Hello, Arjan.
>
> On 06/29/2010 08:07 PM, Arjan van de Ven wrote:
>
>> we might be talking past eachother. ;-)
>>
>> let me define an example that is simple so that we can get on the same page
>>
>> assume a system with "enough" cpus, say 32.
>> lets say we have 2 async tasks, that each do an mdelay(1000); (yes I
>> know stupid, but exagerating things makes things easier to talk about)
>>
> That's the main point to discuss tho. If you exaggerate the use case
> out of proportion, you'll end up with something which in the end is
> useful only in the imagination and we'll be doing things just because
> we can. Creating full number of unbound threads might look like a
> good idea to extract maximum cpu parallelism if you exaggerate the use
> case like the above but with the current actual use case, it's not
> gonna buy us anything and might even cost us more via unnecessary
> thread creations.
>

I'm not trying to suggest "unbound".
I'm trying to suggest "don't start bounding until you hit # threads >= #
cpus
you have some clever tricks to deal with bounding things; but lets make
sure that the simple case
of having less work to run in parallel than the number of cpus gets
dealt with simple and unbound.
You also consolidate the thread pools so that you have one global pool,
so unlike the current situation
where you get O(Nr pools * Nr cpus), you only get O(Nr cpus) number of
threads... that's not too burdensome imo.
If you want to go below that then I think you're going too far in
reducing the number of threads in your pool. Really.

so... back to my question; will those two tasks run in parallel or
sequential ?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tejun Heo on 29 Jun 2010 14:40

Hello,

On 06/29/2010 08:22 PM, Arjan van de Ven wrote:
> I'm not trying to suggest "unbound". I'm trying to suggest "don't
> start bounding until you hit # threads >= # cpus you have some
> clever tricks to deal with bounding things; but lets make sure that
> the simple case of having less work to run in parallel than the
> number of cpus gets dealt with simple and unbound.

Well, the thing is, for most cases, binding to cpus is simply better.
That's the reason why our default workqueue was per-cpu to begin with.
There just are a lot more opportunities for optimization for both
memory access and synchronization overheads.

> You also consolidate the thread pools so that you have one global
> pool, so unlike the current situation where you get O(Nr pools * Nr
> cpus), you only get O(Nr cpus) number of threads... that's not too
> burdensome imo. If you want to go below that then I think you're
> going too far in reducing the number of threads in your
> pool. Really.

I lost you in the above paragraph, but I think it would be better to
keep kthread pools separate. It behaves much better regarding memory
access locality (work issuer and worker are on the same cpu and stack
and other memory used by worker are likely to be already hot). Also,
we don't do it yet, but when creating kthreads we can allocate the
stack considering NUMA too.

> so... back to my question; will those two tasks run in parallel or
> sequential ?

If they are scheduled on the same cpu, they won't. If that's
something actually necessary, let's implement it. I have no problem
with that. cmwq already can serve as simple execution context
provider without concurrency control and pumping contexts to async
isn't hard at all. I just wanna know whether it's something which is
actually useful. So, where would that be useful?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Arjan van de Ven on 29 Jun 2010 14:50

On 6/29/2010 11:34 AM, Tejun Heo wrote:
> Hello,
>
> On 06/29/2010 08:22 PM, Arjan van de Ven wrote:
>
>> I'm not trying to suggest "unbound". I'm trying to suggest "don't
>> start bounding until you hit # threads>= # cpus you have some
>> clever tricks to deal with bounding things; but lets make sure that
>> the simple case of having less work to run in parallel than the
>> number of cpus gets dealt with simple and unbound.
>>
> Well, the thing is, for most cases, binding to cpus is simply better.
>

depends on the user.

For "throw over the wall" work, this is unclear.
Especially in the light of hyperthreading (sharing L1 cache) or even
modern cpus (where many cores share a fast L3 cache).

I'm fine with a solution that has the caller say 'run anywhere' vs 'try
to run local'.
I suspect there will be many many cases of 'run anywhere'.isn't hard at
all. I just wanna know whether it's something which is

> actually useful. So, where would that be useful?
>

I think it's useful for all users of your worker pool, not (just) async.

it's a severe limitation of the current linux infrastructure, and your
infrastructure has the chance to fix this...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tejun Heo on 29 Jun 2010 15:10

Hello,

On 06/29/2010 08:41 PM, Arjan van de Ven wrote:
>> Well, the thing is, for most cases, binding to cpus is simply better.
>
> depends on the user.

Heh, yeah, sure, can't disagree with that. :-)

> For "throw over the wall" work, this is unclear. Especially in the
> light of hyperthreading (sharing L1 cache) or even modern cpus
> (where many cores share a fast L3 cache).

There will be many variants of memory configurations and the only way
the generic kernel can optimize memory access is if it localizes stuff
per cpu which is visible to the operating system. That's the lowest
common denominator. From there, we sure can add considerations for
specific shared configurations but I don't think that will be too
common outside of scheduler and maybe memory allocator. It just
becomes too specific to apply to generic kernel core code.

> I'm fine with a solution that has the caller say 'run anywhere' vs
> 'try to run local'. I suspect there will be many many cases of 'run
> anywhere'.isn't hard at all. I just wanna know whether it's
> something which is

Yeah, sure. I can almost view the code in my head right now. If I'm
not completely mistaken, it should be really easy. When a cpu goes
down, all the left works are already executed unbound, so all the
necessary components are already there.

The thing is that once it's not bound to a cpu, where, how and when a
worker runs is best regulated by the scheduler. That's why I kept
talking about wq being simple context provider.

If something is not CPU intensive, CPU parallelism doesn't buy much,
so works which would benefit from parallel execution are likely to be
CPU intensive ones. For CPU intensive tasks, fairness, priority and
all that stuff are pretty important and that's scheduler's job. cmwq
can provide contexts and put some safety limitations but most are best
left to the scheduler.

>> actually useful. So, where would that be useful?
>
> I think it's useful for all users of your worker pool, not (just) async.
>
> it's a severe limitation of the current linux infrastructure, and your
> infrastructure has the chance to fix this...

Yeah, there could be situations where having a generic context
provider can be useful. I'm just not sure async falls in that
category. For the current users, I think we would be (marginally)
better off with bound workers. So, that's the reluctance I have about
updating async conversion.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: workqueue: update cwq alignement
Next: [PATCH] VMware balloon: force compiling as a module