Nonlinear systems and nonlocal supercomputing [Computer Architecture]

Prev: Call for Papers Reminder (extended): The World Congress on Engineering WCE 2010
Next: Call to stop spamming here

From: "Andy "Krazy" Glew" on 23 Mar 2010 00:00

Bernd Paysan wrote:
> Andy "Krazy" Glew wrote:
>> You can have fixed prioritization rules: E.g. local traffic (b) wins,
>> requiring that the downward traffic coming from the above links be
>> either buffered or resent. Or the opposite, one of the downward
>> traffic wins.
>
> It's fairly easy to decide if you should buffer or not - it just depends
> on latency again. I think you can safely assume the data is buffered at
> the origin, i.e. the data there already is in memory, and not
> constructed on the fly - so this resent "buffer" is for free. If the
> time for a "resent" command arrives at the source quick enough so that
> the resent data will arrive at the congested node just in time when the
> previous, prioritized packed is sent out, then we are fine.

I've worked on several projects where this assumption, that the data is in a memory buffer and can be resent, does not
fly, or at least has issues.

a) Zero copy and/or direct from user space messaging the network.

If a different thread with access to the buffer you are sending from can modify the memory buffer while the message is
in flight.

Now, most MPI programs don't care. But, if you want to use this sort of thing to do messaging between protection
domains (which IMHO is the right thing to do: that sort of thin is much more important commercially than MPI within the
same protection domain. Think web services, SOA.) then you have to have a proper semantics for this.

All this was easy before shared memory multiprocessing. The fact that you had entered the kernel meant that you
effectively had the data exclusively. A whole slew of optimizations got shut off when shared memory threading started.

"Proper semantics" means just that - something that the user can depend on, that will not give rise to flakey behavior.
Or, more likely, whatever semantics happened in your first implementation that people depend on.

Given any form of shared memory MP/MT, there can't be any meaningful "atomic" message - the message as of the time of
the system call. Even if the system call copies the memory, the other thread may be modifying it at the same time.
(Yes, we need message ordering models as well as memory ordering models.) Even if the system call remaps memory using
the page table or an IOMMU, some changes may happen. However, that's not so bad - what's bad is if the recipient can
observe part of the message having arrived (in his local buffer), and act on it, and cause a change that causes a thread
to modify the partially sent message. And, yes, the two ends of the message may share memory.

However, I'll accept you saying this is problem may not need to be solved. Just define it away.

b) Messaging straight from registers. (Hmm, let's see now, I have defined such instruction set extensions at 4
companies now. Never shipped, AFAIK.)

The overall issue is whether you support "send and forget". Does the sender block, so that the data to be sent can be
recovered (from the register source). Or does he go on, possibly modifying the register source.

In the shared memory case, there may be no practical way to preserve the original data to be sent.

Copying a small register value is reasonable. Copying a memory value has less precise semantics, as mentioned above.

By the way, if you resend modified data, sometimes security issues arise. They shouldn't, but... sometimes people check
the data as it is being sent, but don't bother checking on a resend. That is not good; every resend needs to be
rechecked, if there is a possibility that the data has changed.

Anyway, this is really just a side point.

> I'm not sure whether I would use a fat-tree topology if I build an on-
> chip network out of 2x2 router primitives. The fat-tree has excellent
> bisection bandwidth, and it offers improved local bandwidth, but it has
> a non-physical topology (i.e. you get longer and longer wires to the
> next level up - and this also means the latency of the nodes is not
> equal), and it doesn't scale well, because for a larger network, you
> need more *and* longer root connections.

I agree; I only started thinking about fat trees because RM wanted scalable bisection bandwidth.

Plus, I have been thinking about my conversation with Ivan Sutherland, and the possibility that all computations are now
bandwidth sensitive, latency tolerant. In which case the longer wires' latency doesn't matter. What matters is the
cost - volume, dollars, power - of all the extra hardware. I think I posted a thought experiment a while back that said
"go for the fancy network as long as the cost of the wires and switches is less than a processor", wrt incrementally
adding network complexity. When the incremental cost exceeds a processor, step back to a simpler network, and add
processors.

> I'd first try with a 2D mesh,

Yep. Meshes are the prototypical limited valency / spatially embedded network.

In the end, it's all meshes.

From: nmm1 on 23 Mar 2010 04:12

In article <4BA83CCF.9010307(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>I've worked on several projects where this assumption, that the data is
>in a memory buffer and can be resent, does not fly, or at least has
>issues.
>
>a) Zero copy and/or direct from user space messaging the network.
>
>If a different thread with access to the buffer you are sending from can
>modify the memory buffer while the message is in flight.
>
>Now, most MPI programs don't care. But, if you want to use this sort of
>thing to do messaging between protection domains (which IMHO is the
>right thing to do: that sort of thing is much more important commercially
>than MPI within the same protection domain. Think web services, SOA.)
>then you have to have a proper semantics for this.
>
>All this was easy before shared memory multiprocessing. The fact that
>you had entered the kernel meant that you effectively had the data
>exclusively. A whole slew of optimizations got shut off when shared
>memory threading started.

Yes. I have had a lot of arguments with the OpenMP and PGAS people,
pointing out that their claims that those are easier to use than message
passing are simply not true. Yes, writing the code is easier, but
getting it right isn't. And the problem is race conditions.

>Given any form of shared memory MP/MT, there can't be any meaningful
>"atomic" message - the message as of the time of the system call. Even
>if the system call copies the memory, the other thread may be modifying
>it at the same time. (Yes, we need message ordering models as well as
>memory ordering models.) Even if the system call remaps memory using
>the page table or an IOMMU, some changes may happen. ...

Not any form. You and I can think of designs where that is not so.
But they don't use existing shared memory architectures!

My favoured design is a combination of the old capability systems and
the copy-on-write file systems. You can then transfer a message
atomically and safely. I doubt that you could implement it on current
hardware with acceptable efficiency.

Regards,
Nick Maclaren.

From: Bernd Paysan on 23 Mar 2010 08:45

Andy "Krazy" Glew wrote:
> I've worked on several projects where this assumption, that the data
> is in a memory buffer and can be resent, does not fly, or at least has
> issues.
>
> a) Zero copy and/or direct from user space messaging the network.

I'd expect that, as it is most efficient to do so.

> If a different thread with access to the buffer you are sending from
> can modify the memory buffer while the message is in flight.

You should allow programs to shoot into their foot, no question ;-).
The point here is that an MPI primitive implemented in such a way is not
atomic, so you *can* send messages, but you aren't allowed to modify
that memory until you receive an acknowledge or similar.

> Now, most MPI programs don't care. But, if you want to use this sort
> of thing to do messaging between protection domains (which IMHO is the
> right thing to do: that sort of thin is much more important
> commercially than MPI within the
> same protection domain. Think web services, SOA.) then you have to
> have a proper semantics for this.

Yes, fully agreed. However, as you describe below, with MP/MT, there is
no way to have it fully atomically. The proper semantics therefore is:
"send" will send what's in the buffer memory between the time the "send"
command was issued and the "acknowledge" event was received. If the
program modifies the buffer after "send", it's a bug in the program.

When you use such a protocol as base for Internet communication, what
really should happen on a packet loss might better be transparent to the
application - e.g. a web page needs a resend, but a speech codec can
rely on the packet loss concealment. The whole application logic should
be event driven, anyways, so when the program receives the event "packet
xy dropped", it can decide what to do. It might have a possibility to
regenerate the content even if the send buffer has been reused in the
meantime.

> However, I'll accept you saying this is problem may not need to be
> solved. Just define it away.

Done ;-).

> b) Messaging straight from registers. (Hmm, let's see now, I have
> defined such instruction set extensions at 4
> companies now. Never shipped, AFAIK.)
>
> The overall issue is whether you support "send and forget". Does the
> sender block, so that the data to be sent can be
> recovered (from the register source). Or does he go on, possibly
> modifying the register source.

I have a similar issue with flow control messages in such a network:
They are "send and forget"-type messages, as well. You can deal with
that to some extent by making flow control messages higher priority, or
buffering them in any case (they are small, so you don't need big
buffers), but you can't resent flow control messages.

So the likely solution to that problem is to have different quality of
service for short and long packets - short packets (containing just a
register value or a flow control message) can be buffered and pass the
network without loss, while long packets will need retransmissions on
overload. This will come with a bandwidth limitation of short packets
(i.e. to achieve the full bandwidth, you definitely need longer packets
- short in the on-chip context means "64 bit of data" or so).

Chuck Moore's very simple SeaForth array has a 16 bit neighborhood
communication mesh, and there, the "send" and "receive" instruction
(really a store for send and a load for receive) blocks. However, this
is really a low-distance point-to-point communication.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

From: Bernd Paysan on 23 Mar 2010 11:17

Bernd Paysan wrote:
> I'd first try with a 2D mesh, i.e. use something like an 8x8 router
> connecting four local nodes plus four bidirectional links in each
> direction.

Some more thoughts about optimizing router structure for a larger mesh:

We want more bandwidth for the "long distance" traffic, i.e. along the
roads, rather than to and from local blocks. We also want to reach far
out, i.e. source routing shouldn't waste to much bits on long
interconnection. Fat tree is fine here, because as a hierarchical
structure, it compresses optimal - but the decision was to go with a
topological structure.

So we "stack" our roads to get fast vertical and horizontal "highways",
which favor global traffic (i.e. if you want to get on the highway or
make a turn, you have to wait/get buffered), and the first decision on
each crossroad therefore is "turn or continue" (one bit per hop, or with
a virtual route it's just a counter that decrements to zero at the last
hop, and there it means "turn"). We use 4 2x2 routers for that.

If the decision was "turn", the next decision is "leave highway or stay"
(2 2x2 routers), and then "left or right" if you stay (another 2 2x2
routers), or "one of the four local corners" if you leave (4 2x2
routers). The total number of router elements is the same as with a
symmetric 8x8 router (12), but we have now a fast and possibly wider
path for the global routing (the wider path should best interleave
several packets, which means only the cummulated bandwidth increases,
not the bandwidth per packet - similar to a road with multiple lanes).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

First | Prev |
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Call for Papers Reminder (extended): The World Congress on Engineering WCE 2010
Next: Call to stop spamming here