From: Anton Ertl on
Mayan Moudgill <mayan(a)bestweb.net> writes:
>Anton Ertl wrote:
>
>> To avoid these overheads and to achieve load balancing, I have been
>> thinking about an implementation that automatically switches between
>> communicating between threads and using coroutining for efficient
>> communication within a thread. Currently these are just ideas, but I
>> have written them down in
>> <http://www.complang.tuwien.ac.at/kps09/pdfs/ertl.pdf>
>>
>
>Did a quick read of the paper. My summary [Correct me if I'm wrong]

No correction needed.

>Kernel scheduling
>5. If kenel Ki pipes information to a Ki+1 on the same core, it may
>yield (using coroutine semantics) to Ki+1 after writing the information

The way I envision is that it yields, eliminating the need for
buffering between the tasks (kernels) in this case.

>Critique:
>1. Only works for pipelines; it doesn't work with more complicated
>graphs such as DAGs.

Does this refer to the load balancing? It would certainly get more
complicated with DAGs. However, I don't think that DAGs would be used
much even if they are supported.

>2. POSIX pipes are byte-streams of potentially unbounded length (i.e.
>you can write an arbitrary number of bytes before the first one is read).

Normally a buffer (typical size: 4KB) is associated with a Unix pipe,
and once that buffer is full, the writer blocks.

> a. You probably want a stronger type system on the data

That's up to the language designer. I would expect that if you add
such a feature to a statically type-checked language, you would also
statically type-check the communication over a channel. That's what
programmers in such a language expect, want, and would miss if a
feature was added without such typechecking.

Personally, I feel that the lack of static type-checking and the
uniformity of data in files and pipes in Unix is a strength that
allows the reuse of filters for many tasks.

> b. Do you want to bound the buffer size?

No bound is guaranteed, but there has to be a bound because of (at
least) hardware constraints (e.g., address space), and practical
considerations (such as wanting to process the data warm in some cache
rather than letting is chill out into main memory and later having to
drag it back in).

>3. You're going to have to worry about copying the data at least once
>and probably twice (data gets created, copied into buffer, data gets
>copied from buffer, data gets used).

My idea is that (on a shared-memory machine) the writing task stores
the data directly into the buffer, and later the reading task fetches
the data exactly from the same place. The tasks may copy the data
from/to another buffer, or the data may come directly from a
computation, or may be used directly in a computation; that's up to
the tasks. There will of course be an implicit copy of the cache line
from the writing task's core to the reading task's core, but no
explicit copy is necessary.

>It is possible to come up with
>implementations of 2b that, if the data is not copied, can lead to lockup.

Do you have a particular scenario in mind?

>4. The overhead of coroutining is not the state save, its the loss of
>locality of the caching and prediction structures.

Yes, that's a potential problem. We will have to see if it's a real
problem (and if so, maybe the hardware designers can come up with
prediction structures that also work well under these circumstances).

Why I think that it might be no problem: Today nearly all programs are
sequential. For dealing with the problems that can be divided into
pipelines, they often have quite convoluted control flow. Now if we
divide such a program (part) into pipeline stages, each stage will be
simpler and smaller, and maybe more predictable. If we then combine
some of these stages back together again on one core using
coroutining, they will still not be as bad as a sequential program;
and the hardware was designed to perform well on the sequential
program.

There is one exception: the return stack (predicting function return
addresses) of current machines will not work well when coroutining is
involved. Let's see it as opportunity for computer architects to come
up with a better return predictor.

>5. What if the kernels have different data sizes (e.g. kernel A produces
>data in chunks of 512 and the next one, B, processes it in a chunk of
>2K; your workload will be AbAbAbAB where b is kernel B reading the 512B,
>and then blocking waiting for the next read).

That's for the case when the tasks/kernels are on different cores,
right? My idea is that the load balancing will try to avoid cases
where a buffer is completely empty or completely full. But yes, if it
cannot avoid that, one of the threads involved will block.

>My main objection is (1) - I don't expect that pipelines
>is going to be a very interesting case.

I think there are several reasons why pipes are interesting:

* They help modularity (most other parallel and concurrent programming
practices are at odds with modularity).

* They encourage reusable components (filters).

* They are so useful that programmers have used pipelines even when
they did not intend to exploit parallelism (in Unix shell scripts).

* Not much synchronization is necessary; when the buffer is neither
empty nor full, you get very cheap communication. And you can make
the buffer larger to reduce the expensive cases.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html