|
From: already5chosen on 10 Apr 2008 14:10 On Apr 10, 7:13 pm, matt.rei...(a)sicortex.com wrote: > On Apr 9, 12:51 pm, already5cho...(a)yahoo.com wrote: > > > > > Thanks for information, Matt. > > I vaguely remember from the Byte articles from the mid 90s that 5Kf > > FPU was optimized toward single-precision performance. Did you change > > this part of the core? > > We did goose the FP unit a bit. We rebuilt the FP pipeline to support > double precision at 2FLOPs per cycle (MADD.D a double precision > mull-add), so the double precision FP rate is the same as the single > precision FP rate. > Very well > Other than that, we added cache coherence to the L1, and full > single bit correction/double bit detect to the L1 Dcache. (The I > cache is parity protected.) > For massively-parallel scientific workloads ECC (if it is actually ECC) on L1D cache sounds to me like over-engineering. But what I know? > There were Byte articles in the mid 90's on the 5Kf? Who knew? Power of Internet: http://futuretech.blinkenlights.nl/byte.html
From: Nick Maclaren on 10 Apr 2008 14:56 In article <fe2aecdf-9c0c-4289-b25b-403bcbdf69b0(a)v26g2000prm.googlegroups.com>, already5chosen(a)yahoo.com writes: |> On Apr 10, 7:13 pm, matt.rei...(a)sicortex.com wrote: |> |> > Other than that, we added cache coherence to the L1, and full |> > single bit correction/double bit detect to the L1 Dcache. (The I |> > cache is parity protected.) |> |> For massively-parallel scientific workloads ECC (if it is actually |> ECC) on L1D cache sounds to me like over-engineering. But what I know? Why? Mere single-bit detection on very large amounts of cache isn't nice, because it forces a policy of almost panicky replacement. If errors were completely independent, that wouldn't be necessary, but they aren't. Also, ECC gives you the option of delaying the write-through if the memory / higher-level cache channel is busy, where parity doesn't. I agree that it isn't critical, but it means that some other problems can be avoided, and ECC technology is very well understood :-) Regards, Nick Maclaren.
From: matt.reilly on 10 Apr 2008 17:07 re: ECC on L1D Nick is right -- ECC is well understood. We go back and forth on this all the time. Here's the analysis that always leads me to ECC: 1. Assume a single-bit-upset rate of 4000 failures per megabit-billion- hours. (That's a reasonably conservative rule of thumb for a static RAM array at 7000ft elevation. There are few reliable numbers with good pedigree, this isn't one of them. But you have to start with some model. Actual reported numbers are all over the map.) 2. Assume the double-bit upset rate is far far lower. (If you don't then you need to look at some other considerations.) 3. For SiCortex: 5832 processors * 32KB * 8 bit/B * 4000 / (1Mbit * 10e9 hrs) gives 6 failures somewhere in the system every 1000 hours. And that is the difference between designing with thousands of processors in mind and designing for a single desktop or a small cluster. (But note that even those systems put ECC on the L1 caches now.) An alternative is to build a write-through L1 and many good designs take this route. At SiCortex we decided to put ECC in the L1 and keep the L1 to L2 write path clean. (Write through caches either require write aggregating -- a wonderland of interesting memory ordering issues that must be addressed -- or partial word writes into the L2 -- generally extra hair that we chose to avoid.) BG/L chose to do a write-through L1, as I recall. To each his own... The SiCortex system design paid careful attention to reliability and error recovery. Wires aren't perfect. Bad things sometimes happen to good bits. Transients dominate. And so, the SC5832, with over 26K diff pairs in its fabric, does automatic error detection and retry at the link level. (And we guarantee in- order message delivery, even in the presence of transient errors. Permanent faults are another (and thankfully, much much rarer) matter. We "over engineer" the thermal aspects so each node chip runs relatively cool. Every RAM array that contains "unique" data (all RAM arrays other than the ICache) is protected by ECC. And when things go wrong, the system can configure around sick nodes until a service intervention can be scheduled. Apologies if this sounded like a commercial. That wasn't my intent. The intent is to get across the idea that large N multiprocessors like BG/L and SiCortex and Cray XT3/XT4 systems only work well when the designers get their heads around the idea that every error source gets multiplied by three orders of magnitude or more. When you are designing a quad socket server pizza box, you could reasonably choose to ignore hardware error cases that happen every 16K hours: chances are the user will blame the software anyway. There are responsible, professional, admirable designers who make this choice every day. Not everybody is willing to pay for hardware reliability. But when the component is designed to be part of an ensemble of thousands of components, the reliability calculation changes. That two year MTTF turns into a one day MTTF, and no amount of bad software can absorb all the blame. ;)
From: Del Cecchi on 10 Apr 2008 19:05 <matt.reilly(a)sicortex.com> wrote in message news:13192c58-df20-464e-bd13-a8d116cd96dc(a)8g2000hsu.googlegroups.com... > On Apr 9, 7:49 pm, "Del Cecchi" <delcecchioftheno...(a)gmail.com> wrote: >> This sounds a lot like a Blue Gene, only of course with Mips taking >> the >> place of PowerPC as the processor. Would you comment on the >> differences? >> >> del > > I doesn't sound anything like a Blue Gene -- it is much much > quieter. ;) > > The major differences relative to BG/L are > > 1. Higher performance inter-node communication (higher BW, lower > end-to-end latency (average under 2uS MPI ping-pong). > > 2. Design centered on 972 nodes and smaller. Our ambitions are > to fill needs in day-to-day production environments, not to beat > the Earth Simulator or occupy slots in the Top500. (Somebody > needs to do that, but we chose to focus elsewhere.) > > 3. Full linux/posix environment on all processors. All system software > and SiCortex libraries are open source. > > 4. Up to 8GB of DRAM per 6 processor node. > > 5. SiCortex has configurations from 72 processors (the deskside > development > system) to 5832 (the cabinet with the gull wing doors). In addition > to the > SC648 (648 processors) and the SC1458 (1458 processors, how DID we > come > up with this naming scheme?) there are incremental versions in between > that > involve replacing processor modules with "placeholder modules." > > 6. Kautz graph topology for all traffic vs. mesh/torus and trees. The > graph > diameter is 6 for the largest SiCortex system. > > 7. "Generic" IO as long as you think of PCI Express Modules as > "Generic." > Specifically, all systems support GigE, Infiniband, and > Fiberchannel. > We support others, but I don't have the supported IO list in front of > me right now. > > 8. BG/L does a better job of managing the processor-memory path: > BG/L stream triads are 6x better than SiCortex. Sigh. > > > > BG/P will probably improve on a few of these, but I haven't seen > results from the BG/P installations yet. > > There are probably other differences, but I'm more versed on the > SiCortex side of things than the BG/L. Thanks. now I have to go bone up on "kautz graph topology" And yes, I have heard BG/P network is better. Not surprising since time has passed.... del
From: Paul A. Clayton on 11 Apr 2008 17:19
On Apr 10, 1:37 pm, matt.rei...(a)sicortex.com wrote: > On Apr 9, 7:49 pm, "Del Cecchi" <delcecchioftheno...(a)gmail.com> wrote: > > > This sounds a lot like a Blue Gene, only of course with Mips taking the > > place of PowerPC as the processor. Would you comment on the differences? > > > del > > I doesn't sound anything like a Blue Gene -- it is much much > quieter. ;) > > The major differences relative to BG/L are I would also consider the 2-wide FPU of Blue Gene processor a significant difference (4 FLOPs per cycle vs. 2 FLOPs per cycle for the SiCortex processor). I was a bit disappointed that the SiCortex did not exploit such SIMD. Perhaps the targeted workloads are not as computationally dense (i.e., the system would be unbalanced relative to memory bandwidth or other resources)? (Two element 'vectorizability' is common, isn't it??) Paul A. Clayton just a technophile reachable as 'paaronclayton' at "embarqmail.com" |