From: already5chosen on
On Apr 10, 7:13 pm, matt.rei...(a)sicortex.com wrote:
> On Apr 9, 12:51 pm, already5cho...(a)yahoo.com wrote:
>
>
>
> > Thanks for information, Matt.
> > I vaguely remember from the Byte articles from the mid 90s that 5Kf
> > FPU was optimized toward single-precision performance. Did you change
> > this part of the core?
>
> We did goose the FP unit a bit. We rebuilt the FP pipeline to support
> double precision at 2FLOPs per cycle (MADD.D a double precision
> mull-add), so the double precision FP rate is the same as the single
> precision FP rate.
>

Very well


> Other than that, we added cache coherence to the L1, and full
> single bit correction/double bit detect to the L1 Dcache. (The I
> cache is parity protected.)
>

For massively-parallel scientific workloads ECC (if it is actually
ECC) on L1D cache sounds to me like over-engineering. But what I know?

> There were Byte articles in the mid 90's on the 5Kf? Who knew?

Power of Internet: http://futuretech.blinkenlights.nl/byte.html
From: Nick Maclaren on

In article <fe2aecdf-9c0c-4289-b25b-403bcbdf69b0(a)v26g2000prm.googlegroups.com>,
already5chosen(a)yahoo.com writes:
|> On Apr 10, 7:13 pm, matt.rei...(a)sicortex.com wrote:
|>
|> > Other than that, we added cache coherence to the L1, and full
|> > single bit correction/double bit detect to the L1 Dcache. (The I
|> > cache is parity protected.)
|>
|> For massively-parallel scientific workloads ECC (if it is actually
|> ECC) on L1D cache sounds to me like over-engineering. But what I know?

Why? Mere single-bit detection on very large amounts of cache isn't
nice, because it forces a policy of almost panicky replacement. If
errors were completely independent, that wouldn't be necessary, but
they aren't.

Also, ECC gives you the option of delaying the write-through if the
memory / higher-level cache channel is busy, where parity doesn't.

I agree that it isn't critical, but it means that some other problems
can be avoided, and ECC technology is very well understood :-)


Regards,
Nick Maclaren.
From: matt.reilly on
re: ECC on L1D

Nick is right -- ECC is well understood.

We go back and forth on this all the time. Here's the analysis that
always
leads me to ECC:

1. Assume a single-bit-upset rate of 4000 failures per megabit-billion-
hours.
(That's a reasonably conservative rule of thumb for a static RAM array
at
7000ft elevation. There are few reliable numbers with good pedigree,
this isn't
one of them. But you have to start with some model. Actual reported
numbers
are all over the map.)

2. Assume the double-bit upset rate is far far lower. (If you don't
then you need
to look at some other considerations.)

3. For SiCortex: 5832 processors * 32KB * 8 bit/B * 4000 / (1Mbit *
10e9 hrs)
gives 6 failures somewhere in the system every 1000 hours.

And that is the difference between designing with thousands of
processors in
mind and designing for a single desktop or a small cluster. (But note
that even
those systems put ECC on the L1 caches now.)

An alternative is to build a write-through L1 and many good designs
take this
route. At SiCortex we decided to put ECC in the L1 and keep the L1 to
L2
write path clean. (Write through caches either require write
aggregating -- a
wonderland of interesting memory ordering issues that must be
addressed -- or
partial word writes into the L2 -- generally extra hair that we chose
to avoid.)
BG/L chose to do a write-through L1, as I recall. To each his own...


The SiCortex system design paid careful attention to reliability and
error
recovery. Wires aren't perfect. Bad things sometimes happen to good
bits.
Transients dominate.

And so, the SC5832, with over 26K diff pairs in its fabric, does
automatic
error detection and retry at the link level. (And we guarantee in-
order
message delivery, even in the presence of transient errors.
Permanent
faults are another (and thankfully, much much rarer) matter.

We "over engineer" the thermal aspects so each node chip runs
relatively cool.

Every RAM array that contains "unique" data (all RAM arrays other than
the
ICache) is protected by ECC.

And when things go wrong, the system can configure around sick nodes
until
a service intervention can be scheduled.


Apologies if this sounded like a commercial. That wasn't my intent.
The intent
is to get across the idea that large N multiprocessors like BG/L and
SiCortex and
Cray XT3/XT4 systems only work well when the designers get their heads
around
the idea that every error source gets multiplied by three orders of
magnitude or more.

When you are designing a quad socket server pizza box, you could
reasonably
choose to ignore hardware error cases that happen every 16K hours:
chances are
the user will blame the software anyway. There are responsible,
professional, admirable
designers who make this choice every day. Not everybody is willing to
pay for
hardware reliability.

But when the component is designed to be part of an ensemble of
thousands of
components, the reliability calculation changes. That two year MTTF
turns into
a one day MTTF, and no amount of bad software can absorb all the
blame. ;)


From: Del Cecchi on

<matt.reilly(a)sicortex.com> wrote in message
news:13192c58-df20-464e-bd13-a8d116cd96dc(a)8g2000hsu.googlegroups.com...
> On Apr 9, 7:49 pm, "Del Cecchi" <delcecchioftheno...(a)gmail.com> wrote:
>> This sounds a lot like a Blue Gene, only of course with Mips taking
>> the
>> place of PowerPC as the processor. Would you comment on the
>> differences?
>>
>> del
>
> I doesn't sound anything like a Blue Gene -- it is much much
> quieter. ;)
>
> The major differences relative to BG/L are
>
> 1. Higher performance inter-node communication (higher BW, lower
> end-to-end latency (average under 2uS MPI ping-pong).
>
> 2. Design centered on 972 nodes and smaller. Our ambitions are
> to fill needs in day-to-day production environments, not to beat
> the Earth Simulator or occupy slots in the Top500. (Somebody
> needs to do that, but we chose to focus elsewhere.)
>
> 3. Full linux/posix environment on all processors. All system software
> and SiCortex libraries are open source.
>
> 4. Up to 8GB of DRAM per 6 processor node.
>
> 5. SiCortex has configurations from 72 processors (the deskside
> development
> system) to 5832 (the cabinet with the gull wing doors). In addition
> to the
> SC648 (648 processors) and the SC1458 (1458 processors, how DID we
> come
> up with this naming scheme?) there are incremental versions in between
> that
> involve replacing processor modules with "placeholder modules."
>
> 6. Kautz graph topology for all traffic vs. mesh/torus and trees. The
> graph
> diameter is 6 for the largest SiCortex system.
>
> 7. "Generic" IO as long as you think of PCI Express Modules as
> "Generic."
> Specifically, all systems support GigE, Infiniband, and
> Fiberchannel.
> We support others, but I don't have the supported IO list in front of
> me right now.
>
> 8. BG/L does a better job of managing the processor-memory path:
> BG/L stream triads are 6x better than SiCortex. Sigh.
>
>
>
> BG/P will probably improve on a few of these, but I haven't seen
> results from the BG/P installations yet.
>
> There are probably other differences, but I'm more versed on the
> SiCortex side of things than the BG/L.

Thanks. now I have to go bone up on "kautz graph topology"

And yes, I have heard BG/P network is better. Not surprising since time
has passed....

del


From: Paul A. Clayton on
On Apr 10, 1:37 pm, matt.rei...(a)sicortex.com wrote:
> On Apr 9, 7:49 pm, "Del Cecchi" <delcecchioftheno...(a)gmail.com> wrote:
>
> > This sounds a lot like a Blue Gene, only of course with Mips taking the
> > place of PowerPC as the processor. Would you comment on the differences?
>
> > del
>
> I doesn't sound anything like a Blue Gene -- it is much much
> quieter. ;)
>
> The major differences relative to BG/L are

I would also consider the 2-wide FPU of Blue Gene processor a
significant difference (4 FLOPs per cycle vs. 2 FLOPs per cycle
for the SiCortex processor). I was a bit disappointed that the
SiCortex did not exploit such SIMD. Perhaps the targeted
workloads are not as computationally dense (i.e., the system
would be unbalanced relative to memory bandwidth or other
resources)? (Two element 'vectorizability' is common, isn't it??)


Paul A. Clayton
just a technophile
reachable as 'paaronclayton'
at "embarqmail.com"
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Committed Instructions
Next: Need of "Precise Exceptions"