From: "Andy "Krazy" Glew" on
Bernd Paysan wrote:
> I think the obvious thing was mentioned in the paper: "make the routers
> as simple as possible". This means that the routing information
> contains a physical route (a sequence of turns - the most simple router
> is a butterfly router with two inputs, two outputs, and one bit of
> routing information), and the router just passes on packets as it knows
> the next hop from the first beat of the message. On collisions, it has
> only few options:

Our (terje's) touching on fat-trees has me thinking about a different routing primitive: 2 pairs of bidirectional links,
1 pair "up" (coming from below) and 1 pair "down" (coming from above). Where "bidirectional" doesn't mean a single link
that can be turned around, but rather a pair of unidirectional links.

The fully populated fat tree property means that the "up" (coming from below) traffic can never be blocked. You have
two links coming from below, and two going up. The connectedness property of a fat tree means that you never get into a
dead end. You only have to choose which link you want to use.

You can get into collisions with downward directed traffic. You can get such collisions (a) between traffic coming from
both above links that is directed below, and (b) with local traffic coming from below link #1 that wants to go to
below link #2. (c) Or both.

You can buffer.

You can have fixed prioritization rules: E.g. local traffic (b) wins, requiring that the downward traffic coming from
the above links be either buffered or resent. Or the opposite, one of the downward traffic wins.

If I assume that buffer-less nodes are cheaper than buffer-full nodes, we want to have a fabric that has many of these
buffer-less nodes, with either of the collisions rules above And with a smaller-number of buffer-ful nodes, perhaps
only at the root layer of the fat-tree. But probably at multiple levels.

Note that prioritizing locality is a nice simple policy. Not the only possible policy, but a simple policy.

I'm trying to imagine policis that temporarily prioritize donward traffic coming from above, so that blocked local
traffic propagates upwards along the guaranteed to be present fat-tree links until it comes to a buffer layer. I think
this has unit gain (<= 1) positive feedback, so that it is stable - you might get into a mode where all of your traffic
gets sent to the buffering layer, but you wouldn't get worse and worse, the sort of replay tornadoes that some of us are
familiar with. You would require some sort of global scheduling so that downward traffic collisions did not create an
unstable system.

Because of the possibility of getting stuck in a limit cycle of poor performance (everything going to the buffering
layers, e.g. memory with no benefit of locality), you might want to "batch" actions - give local traffic priority while
establishing a link for downward traffic.

I am sure that this has already been thought of.

---

Q: is a 2 input 2 output 2x2 router really the best primitive?

Why not something that has 3 sets of links
From/To Down Link #1
1 in
1 out
From/To Down Link #2
1 in
1 out
From/To Up Link
1 in
3 out

This has sufficiently uplinks that we can always route in terms of collisions, never buffer or drop/retry.

Above I described

From/To Down Link #1
1 in
1 out
From/To Down Link #2
1 in
1 out
From/To Up Link
2 in
2 out


although this doesn't have the always routable on collision policy.

But

From/To Down Link #1
1 in
1 out
From/To Down Link #2
1 in
1 out
From/To Up Link
2 in
4 out

would have the always routable on collision policy.
From: Del Cecchi on

"Anne & Lynn Wheeler" <lynn(a)garlic.com> wrote in message
news:m3mxy1h9wc.fsf(a)garlic.com...
>
> Del Cecchi` <delcecchi(a)gmail.com> writes:
>> The original motivation was to do molecular simulations in the
>> bio-tech field, hence the name. Sure, IBM seized on the desire of
>> the
>> National Labs for prestige and bomb simulation and used it to make
>> a
>> profit.
>
> or seized on national labs (& numerical intensive) only as possible
> walling off move into commercial at the same time.
>
> old email
> http://www.garlic.com/~lynn/lhwemail.html#medusa
>
> old post about jan92 moving into commercial also
> http://www.garlic.com/~lynn/95.html#13
>
> a few weeks before being told it was transferred and couldn't work
> on
> anything with more than four processors.
>
> old email, a couple days/hrs ... before the hammer fell
> http://www.garlic.com/~lynn/2006x.html#email920129
>
> discussing the national lab scenario (I had to skip a LLNL meeting
> because of other commitments ... but some of the people at the
> meeting dropped by afterwards to bring me up to date).
>
> then the press item shortly after the hammer fell (17feb92)
> http://www.garlic.com/~lynn/2001n.html#6000clusters1
>
> and another press item later that summer (we were both gone within a
> few
> weeks):
> http://www.garlic.com/~lynn/2001n.html#6000clusters2
>
> the kingston engineering & scientific had been doing molecular
> simulation with numerous Floating Point Systems boxes tied to 3090
> with
> vector facility.
>
> In 1980, I had done some HYPERChannel work to allow overflow in the
> Santa Teresa lab. (300 people from IMS group) to be moved to offsite
> bldg ... but getting local interactive performance using
> HYPERChannel as
> mainframe channel extension. Then basically did the same
> installation
> for large IMS field support group in boulder. recent reference
> http://www.garlic.com/~lynn/2010f.html#17
>
> The person that I worked with for the Boulder installation then
> moved to
> Kingston to manage the Kingston E&S operation. I worked with him
> there
> to do high-speed HYPERChannel satellite link between Kingston E&S
> lab
> and the west coast. This was somewhat totally unrelated to the
> operation that was supposedly designing their own numerical
> intensive
> supercomputer and also providing funding for Steve Chen's effort.
> recent post with a little more of the gory details:
> http://www.garlic.com/~lynn/2010b.html#71 Happy DEC-10 Day
>
> The above tended to have some LLNL ties, in part because early
> backing
> for FCS was standards moving to fiber-optics ... something that LLNL
> had
> installed in serial-copper form.
>
> The SCI stuff was with Gustafson out of SLAC.
>
> Later one of the sparc-10 engineers was at another chip-shop and
> designing a fast/inexpensive SCI subset ... and tried to interest me
> into taking over the SUN SPRING/DOE operating system effort and
> adapting
> it to a large distributed SCI infrastructure. This was about the
> time
> SUN was shutting down the SPRING/DOE effort and transferring
> everybody
> over to Java group.
>
I was on the SCI committee, although I sort of came in in the middle.
And Rochester had an effort to use SCIL (SCI Like) interface to couple
AS400 Boxes in something we called "firmly coupled". The software
guys had even signed up. But some guy from POK (Baum?) put the kibosh
on it since he didn't believe the Rochester guys could make OS400 NUMA
when the Z folks said it would take hundreds of PY. POK always had a
NIH complex.

But in the end the SCI knock off ended up in Xseries NUMA box.

The topology was dual counter rotating rings.

del



From: Bernd Paysan on
MitchAlsup wrote:
> The other option is the virtual router where the first beat of
> information is the virtual route, and the router takes a beat to
> lookup the physical port accociated wth the requests at hand. Adds one
> clock of delay from pin to pin, solves a lot of problems.

Yes, but works only well when the lookup table is small (IMHO in the
order of 8-12 address bits, as virtual routing for an n bit path is
O(exp(n))). So you can have virtual routes for sufficiently small
subnets, you can route around a lot of problems inside these subnets
without propagating them to a global routing table.

Virtual routing is a good idea to improve source routing by subdividing
the network into islands, which are internally source routed, and on the
boundary virtual routing is used to hide details you otherwise have to
communicate to too many other hosts.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
From: Anne & Lynn Wheeler on

"Del Cecchi" <delcecchi(a)gmail.com> writes:
> And Rochester had an effort to use SCIL (SCI Like) interface to couple
> AS400 Boxes in something we called "firmly coupled". The software
> guys had even signed up. But some guy from POK (Baum?) put the kibosh
> on it since he didn't believe the Rochester guys could make OS400 NUMA
> when the Z folks said it would take hundreds of PY. POK always had a
> NIH complex.
>
> But in the end the SCI knock off ended up in Xseries NUMA box.
>
> The topology was dual counter rotating rings.

re:
http://www.garlic.com/~lynn/2010f.html#47 Nonlinear systems and nonlocal supercomputing

by the time of SCIL ... we were gone from IBM ... and was only
intermediately involved with SCI (couldn't do a whole lot of
self-funding on standards committees).

long ago and far away, baum was hired into pok to be in charge of
(mainframe) tightly-coupled shared-memory multiprocessor architecture
.... at the same time my wife was con'ed into moving from the JES group
in G'burg to POK to be in charge of (mainframe) loosely-coupled (aka
cluster) architecture .... and for a time, both reported to the same
manager. mainframe shared-memory for long time required much stronger
memory consistency ... that provided in NUMA.

during her stint in POK, there was almost exclusive focus on
tightly-coupled ... and she didn't stay very long there. Her
loosely-coupled architecture (peer-coupled shared-data) saw very little
(mainframe) uptake, except for IMS hot-standby ... until sysplex.

much later, Steve Chen was CTO at sequent and they were doing NUMA-Q
(SCI) and we did some consulting for Steve. later IBM buys sequent. a
few recent references:
http://www.garlic.com/~lynn/2010e.html#68 Entry point for a Mainframe?
http://www.garlic.com/~lynn/2010e.html#70 Entry point for a Mainframe?
http://www.garlic.com/~lynn/2010f.html#7 What was the historical price of a P/390?
http://www.garlic.com/~lynn/2010f.html#13 What was the historical price of a P/390?

there is a similar joke about the internal network. there was somebody
from corporate hdqtrs in armonk who had participated in SNA
investigation on what would be required to implement a world-wide
distributed network ... that came up with enormous amounts of PY ... in
part because SNA is so fundamentally opposite to real distributed
network. It turns out the majority internal network was done by a single
person ... but it used a totally different approach that made world-wide
distributed network a relatively trivial result. In anycase, the armonk
expert stated that the internal network could not exist because the
corporation had never provided funding for such an enormous PY for
networking.

totally unrelated recent reference to dual counter-rotating rings from
long ago and far away:
http://www.garlic.com/~lynn/2010e.html#69 search engine history, was Happy DEC-10 Day

aka 1mbit/sec LAN being done for replacing copper wiring harness bundles
in autos.

--
42yrs virtualization experience (since Jan68), online at home since Mar1970
From: Anne & Lynn Wheeler on

re:
http://www.garlic.com/~lynn/2010f.html#47 Nonlinear systems and nonlocal supercomputing
http://www.garlic.com/~lynn/2010f.html#48 Nonlinear systems and nonlocal supercomputing

in fact, one of the reason for doing (rios) cluster scaleup ... was at
the time, there was no cache consistency support to allow doing anything
at all with SCI (the only scaleup was cluster). the engineering manager
that we reported to (when starting cluster scaleup) ... had only
recently moved over to head up the new somerset organization (motorola,
ibm, apple, etc) ... which would do a single-chip 801/risc and
eventually produce something that had any kind of cache-consistency
primitives for any kind of shared memory operations. but by that time
any kind of cache consistency support existed, we were long gone.

he does later show up as president of mips for a stint ... and we do
some stuff.

--
42yrs virtualization experience (since Jan68), online at home since Mar1970