Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance [Computer Architecture]

Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance

From: Terje Mathisen "terje.mathisen at on 29 Jul 2010 03:34

Paul A. Clayton wrote:
> On Jul 28, 5:32 pm, MitchAlsup<MitchAl...(a)aol.com> wrote:
> [snip]
>> The wide memory bus is invariably faster, especialy with small number
>> of DIMMs.
>
> Wouldn't having twice as many potentially active DRAM banks (two
> independent channels vs. two DIMM channels merged to a single
> addressed channel) be a significant benefit for many multithreaded
> and some single-threaded applications where bank conflicts might be
> more common (especially with a "small number of DIMMs")?

It would, unless the size of a combined channel is still less than or
the same as a cache line:

When a single channel can deliver 64 bits and a combined channel 128
bits, and a cache line (at 512+ bits) is the smallest unit of transfer,
then you _want_ wider channels.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Benny Amorsen on 29 Jul 2010 07:21

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:

> Yes, but if you want to take a chance and skip the trailing checksum
> test, in order to forward packets as soon as you have the header, then
> you would have even more severe timing restrictions, right?

There are several layers of checksums at play here. If we stick to IP,
only the header has a checksum, and for IPv6 even that has been removed.
So there isn't really a chance to take, because you have the checksum
before you start receiving the payload (and the payload isn't
protected).

There is a whole-packet checksum at the ethernet level (if the physical
layer happens to be ethernet, of course). Switches used to pretty much
universally do cut-through switching until gigabit switches arrived.
Almost all gigabit switches are store-and-forward, but somehow latency
was rediscovered in 10Gbps-switches, so quite a few of those are
cut-through.

Unfortunately "cut-through routing" refers to something entirely
different from "cut-through switching". I haven't been able to find any
products claiming to do anything but store-and-forward routing.

/Benny

From: MitchAlsup on 29 Jul 2010 08:45

On Jul 28, 7:40 pm, "Paul A. Clayton" <paaronclay...(a)embarqmail.com>
wrote:
> On Jul 28, 5:32 pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
> [snip]
>
> > The wide memory bus is invariably faster, especialy with small number
> > of DIMMs.
>
> Wouldn't having twice as many potentially active DRAM banks (two
> independent channels vs. two DIMM channels merged to a single
> addressed channel) be a significant benefit for many multithreaded
> and some single-threaded applications where bank conflicts might be
> more common (especially with a "small number of DIMMs")?

This was extensively simulated, and to our surprise::

The first data beat arrives at the same point in time on both the side
and narrow arrangements. The last data beat arrives a lot longer
afterwards on the narrow arrangement. It is the last data beat which
governs the sending of the line through the Crossbar. This is not
inevitably necessary, but it is on the crossbar in Opteron. Once the
memory controller starts sending the first line, it cannot switch to
another line and use the BW available in the fabric router. So,
getting the whole line out into the fabric is the key, and why the
dual DIMM bus does not work as well as one would expect.

It is perfectly reasonable to build the fabric where this property is
not a limiting factor and actually get better performance with more
banks.

Mitch

From: Thomas Womack on 30 Jul 2010 13:47

In article <5fb1774d-6056-4564-a6c8-4c9919a50cd7(a)j8g2000yqd.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:

>People who need to get the physics right know how to do it:
>
>http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf

Are http://code.google.com/p/p3dfft/ and
http://www.sdsc.edu/us/resources/p3dfft/docs/TG08_DNS.pdf relevant in
this case? Large 3D FFTs decomposing over lots of processors into
(N/x)*(N/y)*N bricks, just over 100 seconds for 8192^3 on 2^15 CPUs
(2048 quad-quad-opterons) at TACC. The TACC machine 'Ranger' is a
load of racks plus a monolithic 3456-port 40GBit Infiniband switch
from Sun, so doesn't look that dissimilar to a national-labs machine.

I suppose the question is what counts as awful - it's 5% of peak, but
(figure 2 of TG08_DNS) it's 5% of peak from 2^9 to 2^15 CPUs, which
it's not ludicrous to call scalable.

Tom

From: Terje Mathisen "terje.mathisen at on 31 Jul 2010 02:22

Thomas Womack wrote:
> In article<5fb1774d-6056-4564-a6c8-4c9919a50cd7(a)j8g2000yqd.googlegroups.com>,
> Robert Myers<rbmyersusa(a)gmail.com> wrote:
>
>> People who need to get the physics right know how to do it:
>>
>> http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf
>
> Are http://code.google.com/p/p3dfft/ and
> http://www.sdsc.edu/us/resources/p3dfft/docs/TG08_DNS.pdf relevant in
> this case? Large 3D FFTs decomposing over lots of processors into
> (N/x)*(N/y)*N bricks, just over 100 seconds for 8192^3 on 2^15 CPUs
> (2048 quad-quad-opterons) at TACC. The TACC machine 'Ranger' is a
> load of racks plus a monolithic 3456-port 40GBit Infiniband switch
> from Sun, so doesn't look that dissimilar to a national-labs machine.
>
> I suppose the question is what counts as awful - it's 5% of peak, but
> (figure 2 of TG08_DNS) it's 5% of peak from 2^9 to 2^15 CPUs, which
> it's not ludicrous to call scalable.

No, not at all.

To me, the most interesting part of that paper was the way they had to
tune their 2D decomposition for the actual core layout, i.e. with 16
cores/node the most efficient setup was with 4xN "pencils", almost
certainly due to those 16 cores coming from 4 4-core CPUs.

The other highly significant piece of information was that they got away
with single-prec numbers!

Using DP instead would double memory and communication sizes, while it
would reduce FP throughput by an order of magnitude on something like a
Cell or most GPUs. OTOH, this would still be fast enough to keep up with
the communication network, right?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance