Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: "Andy "Krazy" Glew" on 21 Dec 2009 16:57

Bernd Paysan wrote:
> Andy "Krazy" Glew wrote:
>> 1) SMP: shared memory, cache coherent, a relatively strong memory
>> ordering model like SC or TSO or PC. Typically writeback cache.
>>
>> 0) MPI: no shared memory, message passing
>
> You can also have shared "write-only" memory. That's close to the MPI
> side of the tradeoffs. Each CPU can read and write its own memory, but
> can only write remote memories. The pro side is that all you need is a
> similar infrastructure to MPI (send data packets around), and thus it
> scales well; also, there are no blocking latencies.
>
> The programming model can be closer do data flow than pure MPI, since
> when you only pass data, writing the data to the target destination is
> completely sufficient. An "this data is now valid" message might be
> necessary (or some log of the memory controller where each CPU can
> extract what regions were written to).

At first I liked this, and then I realized what I liked was the idea of
being able to create linked data structures, readable by anyone, but
only manipulated by the local node - except for the minimal operations
necessary to link new nodes into the data structure.

I don't think that ordinary read/write semantics are acceptable. I
think that you need the ability to "atomically" (for some definition of
atomic - all atomicity is relative) read a large block of data. Used by
a node A to read a data node in node B's memory.

Node A might then allocate new nodes in its own memory. And publish
them as follows, probably using atomic rmw type operations to link the
new node into the old data structure. Compare-and-swap, possibly
fancier ops like atomic insert into hash table.

(By the way, I have great sympathy for sending chunks of code around -
is that "Actors"? - except that I would like some of these operations to
be handled by memory controller hardware (for any of the several
definitions of memory controller), and it is hard to think of arbitrary
code that is sufficiently constrained.)

From: Bernd Paysan on 22 Dec 2009 17:54

Andy "Krazy" Glew wrote:
>> You can also have shared "write-only" memory. That's close to the
>> MPI
>> side of the tradeoffs. Each CPU can read and write its own memory,
>> but
>> can only write remote memories. The pro side is that all you need is
>> a similar infrastructure to MPI (send data packets around), and thus
>> it scales well; also, there are no blocking latencies.
>>
>> The programming model can be closer do data flow than pure MPI, since
>> when you only pass data, writing the data to the target destination
>> is
>> completely sufficient. An "this data is now valid" message might be
>> necessary (or some log of the memory controller where each CPU can
>> extract what regions were written to).
>
> At first I liked this, and then I realized what I liked was the idea
> of being able to create linked data structures, readable by anyone,
> but only manipulated by the local node - except for the minimal
> operations necessary to link new nodes into the data structure.

That's the other way round, i.e. single writer, multiple readers (pull
data in). What I propose is single reader, multiple writer (push data
out).

> I don't think that ordinary read/write semantics are acceptable. I
> think that you need the ability to "atomically" (for some definition
> of
> atomic - all atomicity is relative) read a large block of data. Used
> by a node A to read a data node in node B's memory.

Works by asking B to send data over to A.

> Node A might then allocate new nodes in its own memory. And publish
> them as follows, probably using atomic rmw type operations to link the
> new node into the old data structure. Compare-and-swap, possibly
> fancier ops like atomic insert into hash table.
>
> (By the way, I have great sympathy for sending chunks of code around -
> is that "Actors"? - except that I would like some of these operations
> to be handled by memory controller hardware (for any of the several
> definitions of memory controller), and it is hard to think of
> arbitrary code that is sufficiently constrained.)

Sending chunks of code around which are automatically executed by the
receiver is called "active messages". I not only like the idea, a
friend of mine has done that successfully for decades (the messages in
question were Forth source - it was a quite high level of active
messages). Doing that in the memory controller looks like a good idea
for me, too, at least for that kind of code a memory controller can
handle. The good thing about this is that you can collect all your
"orders", and send them in one go - this removes a lot of latency,
especially if your commands can include something like compare&swap or
even a complete "insert into list/hash table" (that, unlike
compare&swap, won't fail).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

From: Robert Myers on 23 Dec 2009 00:48

On Dec 22, 5:54 pm, Bernd Paysan <bernd.pay...(a)gmx.de> wrote:
> Andy "Krazy" Glew wrote:
> >> You can also have shared "write-only" memory. That's close to the
> >> MPI
> >> side of the tradeoffs. Each CPU can read and write its own memory,
> >> but
> >> can only write remote memories. The pro side is that all you need is
> >> a similar infrastructure to MPI (send data packets around), and thus
> >> it scales well; also, there are no blocking latencies.
>
> >> The programming model can be closer do data flow than pure MPI, since
> >> when you only pass data, writing the data to the target destination
> >> is
> >> completely sufficient. An "this data is now valid" message might be
> >> necessary (or some log of the memory controller where each CPU can
> >> extract what regions were written to).
>
> > At first I liked this, and then I realized what I liked was the idea
> > of being able to create linked data structures, readable by anyone,
> > but only manipulated by the local node - except for the minimal
> > operations necessary to link new nodes into the data structure.
>
> That's the other way round, i.e. single writer, multiple readers (pull
> data in). What I propose is single reader, multiple writer (push data
> out).
>
> > I don't think that ordinary read/write semantics are acceptable. I
> > think that you need the ability to "atomically" (for some definition
> > of
> > atomic - all atomicity is relative) read a large block of data. Used
> > by a node A to read a data node in node B's memory.
>
> Works by asking B to send data over to A.
>
> > Node A might then allocate new nodes in its own memory. And publish
> > them as follows, probably using atomic rmw type operations to link the
> > new node into the old data structure. Compare-and-swap, possibly
> > fancier ops like atomic insert into hash table.
>
> > (By the way, I have great sympathy for sending chunks of code around -
> > is that "Actors"? - except that I would like some of these operations
> > to be handled by memory controller hardware (for any of the several
> > definitions of memory controller), and it is hard to think of
> > arbitrary code that is sufficiently constrained.)
>
> Sending chunks of code around which are automatically executed by the
> receiver is called "active messages". I not only like the idea, a
> friend of mine has done that successfully for decades (the messages in
> question were Forth source - it was a quite high level of active
> messages). Doing that in the memory controller looks like a good idea
> for me, too, at least for that kind of code a memory controller can
> handle. The good thing about this is that you can collect all your
> "orders", and send them in one go - this removes a lot of latency,
> especially if your commands can include something like compare&swap or
> even a complete "insert into list/hash table" (that, unlike
> compare&swap, won't fail).
>
I don't know all the buzz words, so forgive me.

If you know the future (or the dataflow graph ahead of time), you can
assemble packets of whatever. Could be any piece of the problem:
code, data, meta-data, meta-code,... whatever, and send it off to some
location where it knows that the other pieces that are needed for that
piece of the problem will also arrive, pushed from who-cares-where.
When enough pieces are in hand to act on, the receiving location acts
on whatever pieces it can. When any piece of anything that can be
used elsewhere is finished, it is sent on to wherever. The only
requirement is that there is some agent like a DNS that can tell
pieces with particular characteristics the arbitrarily chosen
processors (or collections of processors) to which they should migrate
for further use, and that receiving agents are not required to do
anything but wait until they have enough information to act on, and
the packets themselves will inform the receiving agent what else is
needed for further action (but not where it can be found). Many
problems seem to disappear as if by magic: the need for instruction
and data prefetch (two separate prediction processes), latency issues,
need for cache, and the need to invent elaborate constraints on what
kinds of packets can be passed around, as the structure (and, in
effect, the programming language) can be completely ad hoc.
Concurrency doesn't even seem to be an issue. It's a bit like an
asynchronous processor, and it seems implementable in any circumstance
where a data-push model can be implemented.

I know (or hope) that I'll be told that it's all been thought of and
tried and the reasons why it is impractical. That's the point of the
post.

Robert.

From: Bernd Paysan on 23 Dec 2009 09:05

Robert Myers wrote:
> I don't know all the buzz words, so forgive me.

Buzz words are only useful for "buzzword bingo" and when feeding search
engines ;-).

> If you know the future (or the dataflow graph ahead of time), you can
> assemble packets of whatever. Could be any piece of the problem:
> code, data, meta-data, meta-code,... whatever, and send it off to some
> location where it knows that the other pieces that are needed for that
> piece of the problem will also arrive, pushed from who-cares-where.
> When enough pieces are in hand to act on, the receiving location acts
> on whatever pieces it can. When any piece of anything that can be
> used elsewhere is finished, it is sent on to wherever. The only
> requirement is that there is some agent like a DNS that can tell
> pieces with particular characteristics the arbitrarily chosen
> processors (or collections of processors) to which they should migrate
> for further use, and that receiving agents are not required to do
> anything but wait until they have enough information to act on, and
> the packets themselves will inform the receiving agent what else is
> needed for further action (but not where it can be found). Many
> problems seem to disappear as if by magic: the need for instruction
> and data prefetch (two separate prediction processes), latency issues,
> need for cache, and the need to invent elaborate constraints on what
> kinds of packets can be passed around, as the structure (and, in
> effect, the programming language) can be completely ad hoc.
> Concurrency doesn't even seem to be an issue. It's a bit like an
> asynchronous processor, and it seems implementable in any circumstance
> where a data-push model can be implemented.

Indeed.

> I know (or hope) that I'll be told that it's all been thought of and
> tried and the reasons why it is impractical. That's the point of the
> post.

It has been tried and it works - you can find a number of papers about
active message passing from various universities. However, it seems to
be that most people try to implement some standard protocols like MPI on
top of it, so the benefits might be smaller than expected. And as Andy
already observed: Most people seem to be more comfortable with
sequential programming. Using such an active message system makes the
parallel programming quite explicit - you model a data flow graph, you
create packets with code and data, and so on.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

From: Terje Mathisen "terje.mathisen at on 23 Dec 2009 12:21

Bernd Paysan wrote:
> Sending chunks of code around which are automatically executed by the
> receiver is called "active messages". I not only like the idea, a
> friend of mine has done that successfully for decades (the messages in
> question were Forth source - it was a quite high level of active
> messages). Doing that in the memory controller looks like a good idea
> for me, too, at least for that kind of code a memory controller can
> handle. The good thing about this is that you can collect all your
> "orders", and send them in one go - this removes a lot of latency,
> especially if your commands can include something like compare&swap or
> even a complete "insert into list/hash table" (that, unlike
> compare&swap, won't fail).
>
Why do a feel that this feels a lot like IBM mainframe channel programs?
:-)

(Security is of course implicit here: If you _can_ send the message,
you're obviously safe, right?)

Terje
PS. This is my very first post from my personal leafnode installation: I
have free news access via my home (fiber) ISP, but not here in Rauland
on Christmas/New Year vacation, so today I finally broke down and
installed leafnode on my home FreeBSD gps-based ntp server. :-)
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Prev: PEEEEEEP
Next: Texture units as a general function