From: "Andy "Krazy" Glew" on
Although I don't want to get too involved in the discussion about US supercomputer policy, not my bailiwick, I will
continue to not purely lurk because I am interested in the question: "When is it worthwhile adding processors that will
be left idle much of the time?"

I'm interested in this not just in the context of this discussion, but in terms of past discussions:

e.g. we have discussed multithreaded architectures like Tera, and, for that matter, Intel HT (SMT), AMD MCMT, etc.

One of the basic ideas behind multithreading is to switch to a different thread when the supposedly expensive execution
units would otherwise go idle. Yet such threading has a hardware cost: register files to support such threading get
slower and more power hungry, the buffers to hold threads that are switched out waiting for a cache miss, etc.

At some point it is not worthwhile adding more threads, and their associated hardware. At some point it is probably
worth letting the processor go idle. I'm trying to get my head around what that point is, in a semi-theoretical sense,
to help calibrate my understanding if I get to work on such issues again.

E.g. with multicluster multithreading, as AMD is rumored to be doing, and as I envisaged it: add execution units that
you know will be more idle than in an SMT.

But the tradeoff that I want to consider is not the "stop adding more threading, we are in diminishing returns" area.
Instead I want to consider the tradeoff between

Adding another completely separate processor, with the associated interconnect overhead
vs
Adding more threading.

e.g. let's assume that you have a base processor with flops X, hardware cost H, connected to a network I.

If you add a duplicate core, you get flops 2X (best case), hardware cost 2H, and interconnect cost - well, not 2I.
Perhas 2I in the local interconnect, but definitely not in the global. So let's assume interconnect I = L + G, and say
interconnect is 3L + G. (Note that I am making the assumption that you have a simplistic tree interconnect - to double
the processors, you add an extra layer of local stuff => 3L)

If you double the multithreading, you get flops 2X, hardware cost H+M, and interconnect... well, let's just leave
interconnect unaffected, although realistically you will have to add more hardware. But let's assume that interconnect
can be completely pipelined and clocked faster, wave pipelined or whatever, and that doesn't cost you.

Subtracting duplicate core versus doubling the multithreading, C2 - M2 = (2H + 3L+G) - (H+M + L+G) = H + 2L - M.

Which means that you only stop multithreading when the size of the multithreading overhead, the RFs etc., is more than
the cost of the CPU and local interconnect. Which is a long way out. (Or when the ALUs are almost fully used, which
is closer).

However, the above is too simplistic: when you double cores, you don't get 2X the flops. And when you double threads,
you get even less flops. Perhaps it is better to compare in terms of total cost per flop:

(2H+3L+G) / X-multicore

vs

(H+M+3L+G) / X-multithread

But this means that the incremental value, the utilization of the extra flops that you are adding via multithreading,
may be significantly less that the incremental value in multicore. The old econmist "Occasionally it is worth selling a
product below cost, so long as it covers your marginal cost" issue.

It also means that there is a sawtooth wave. If you want to increase performance by 1.5X, it may be worth adding
threads; but when you cross the threshold, you need to add more cores and back off on the threads. I.e. when you add
cores you want to back off a few generations on the threading.


Robert Myers wrote:
> On Mar 10, 2:16 am, Terje Mathisen <"terje.mathisen at tmsw.no">
> wrote:
>> Andy "Krazy" Glew wrote:
>>> Robert Myers wrote:
>>>> A machine that achieves 5% efficiency doing a bog-standard problem has
>>>> little right to claim to be super in any respect. That's my mantra.

>>> I've been lurking and listening and learning, but, hold on: this is silly.
>>> Utilization has nothing to do with "super-ness". Cost effectiveness is
>>> what matters.

> I focus on the FFT because there I am certain of my footing. In a
> broader sense that I can't defend with such specificity, if all your
> computer designs force localized computation or favor it heavily, you
> will skew the kind of science that you can and will do.

I'm quite happy to agree with you. If FFTs are the most important application, and if you are not building the sort of
machine that an FFT needs, in an ideal world you would change the machine design.

I'm just pointing out that the most cost effective FFT machine may not be the "ideal" FFT machine from the point of view
of processor utilization.

I also agree with you that there may be unfortunate side effects: if the most cost effective FFT machine is even more
cost effective for non-FFT calculations, then the science that may go in those non-FFT directions.



>>> But if the flops are cheap compared to the bandwidth, then it may very
>>> well make sense to add lots of flops. You might want to add more
>>> bandwidth, but if you can add a 1% utilized flop and eke a little bit
>>> more out of the expensive interconnect...

>> If having a teraflop available in a $200 chip makes it possible to get
>> 10% better use out of the $2000 (per board) interconnect, then that's a
>> good bargain.
>>
> Maybe it is and maybe it isn't.


> My argument is that the scalability of current machines is
> a PT Barnum hoax with very little scientific payoff.
>
> ... People may intuitively think that if you just pile
> up enough flops in one place, you can do any kind of math you want,
> but that intuition is dangerous.

I definitely agree with you.

> When I start down this road, I will inevitably start to sputter
> because I feel so outnumbered.

Join the club.


> If the limit on interconnects is fundamental, I'd sure like to understand
> why.

It's always fun to hear Burton Smith do his mental calculations of interconnect: "If I have such and such a volume of
copper, then the best bisection bandwidth that I can achieve is ..."
From: "Andy "Krazy" Glew" on
Robert Myers wrote:
> On Mar 10, 9:33 pm, "Del Cecchi" <delcec...(a)gmail.com> wrote:
>> As I said a few days ago, bandwidth costs money.
>> Latency is with us always.
>>
> You keep saying that, and it makes me grind my teeth every time you
> do. When Seymour said, "You can't fake it," he was talking about
> *bandwidth*, not latency. You *can* fake latency, and IBM did some of
> the most fundamental work in that area. You *cannot* fake bandwidth.

We fake latency via caches and prefetchers. Which work pretty well, but for some workloads don't work at all.

We fake bandwidth by replicating computation and compression. Instead of doing the smallest number of computations, you
compress the data, and send the smallest amount of data between nodes, and then use local computation ton uncompress
and/or replicate computations.

Faking bandwidth doesn't seem to work quite as well.

---

I'm thinking about this because it became obvious in my conversation with Ivan Sutherland that his head is totally in
bandwidth space. As you might expect from graphics.

Whereas I have spent most of my career in latency space.

This is causing me to wonder: are there ay important computations that are still latency sensitive? Or is everything
bandwidth sensitive from now on?

I suspect latency still matters, but I want to understand how, in what workloads. Particularly, in what parallel
workloads does latency still matter.

By the way, though, the really latency sensitive supercomputer workloads tend to not be so incompatible
architecture-wise with bandwidth, because they tend to not really benefit from caches, and complex multistage routing
networks that emphasize local connectivity. They tend to want low latency access to global data.
It's not latency vs. bandwidth. It's locality vs. globality, and in the latter maybe global latency vs global bandwidth.

And here may be the crux of the problem: global latency and global bandwidth are not incompatible when the interconnect
is shallow. But when you start adding multiple stages to the interconnect, you tend to hurt global latency; but you
also provide a place that is just so very damned attractive to add a processor. Every switching node is a place you can
add a processor. And when you add processors in the switching nodes, you create architectures that favor locality.


From: Robert Myers on
On Mar 11, 10:01 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
> Robert Myers wrote:
> > On Mar 10, 9:33 pm, "Del Cecchi" <delcec...(a)gmail.com> wrote:
> >>  As I said a few days ago, bandwidth costs money.
> >> Latency is with us always.
>
> > You keep saying that, and it makes me grind my teeth every time you
> > do.  When Seymour said, "You can't fake it," he was talking about
> > *bandwidth*, not latency.  You *can* fake latency, and IBM did some of
> > the most fundamental work in that area.  You *cannot* fake bandwidth.
>
> We fake latency via caches and prefetchers.  Which work pretty well, but for some workloads don't work at all.
>
> We fake bandwidth by replicating computation and compression.  Instead of doing the smallest number of computations, you
> compress the data, and send the smallest amount of data between nodes, and then use local computation ton uncompress
> and/or replicate computations.
>
> Faking bandwidth doesn't seem to work quite as well.
>
In many instances, it doesn't matter if the calculation is even
seconds later, so long as you don't lengthen the critical path
significantly, because those seconds are small compared to the total
compute time for problems put on really big computers.

If you change the bandwidth by the same ratio (a hundred nanoseconds
to a few seconds), you might as well skip the electronics and use
humans with paper and pencil.

Sure "you can't fake bandwidth" can be quibbled with, but the extent
to which faking is potentially useful is very small, as compared to
latency, where, if the access pattern is predictable, the latency to
get the calculation started hardly matters at all. There will always
be problems with unpredictable access patterns. If you really need to
be doing petaflops with calculations like that, either the problem had
best be embarrassingly parallel or you may as well give up.

> ---
>
> I'm thinking about this because it became obvious in my conversation with Ivan Sutherland that his head is totally in
> bandwidth space.  As you might expect from graphics.
>
> Whereas I have spent most of my career in latency space.
>
> This is causing me to wonder: are there ay important computations that are still latency sensitive?  Or is everything
> bandwidth sensitive from now on?
>
Some operations research calculations are inherently serial and
therefore latency sensitive. My argument has been that if such
calculations were all *that* important, you'd see a big market for
computers with heroic cooling.

If even someone all Wall Street is doing it to gain a few
milliseconds, I've not heard of it.

> I suspect latency still matters, but I want to understand how, in what workloads.  Particularly, in what parallel
> workloads does latency still matter.
>
> By the way, though, the really latency sensitive supercomputer workloads tend to not be so incompatible
> architecture-wise with bandwidth, because they tend to not really benefit from caches, and complex multistage routing
> networks that emphasize local connectivity.  They tend to want low latency access to global data.
> It's not latency vs. bandwidth.  It's locality vs. globality, and in the latter maybe global latency vs global bandwidth.
>
> And here may be the crux of the problem: global latency and global bandwidth are not incompatible when the interconnect
> is shallow.  But when you start adding multiple stages to the interconnect, you tend to hurt global latency; but you
> also provide a place that is just so very damned attractive to add a processor.  Every switching node is a place you can
> add a processor.  And when you add processors in the switching nodes, you create architectures that favor locality.

Which is basically where we are right now, and maybe the logic is so
compelling that that's the end of the story.

Robert.
From: Andrew Reilly on
On Thu, 11 Mar 2010 07:38:44 -0600, Del Cecchi` wrote:

> Apparently FFT doesn't let you fake bandwidth or latency.

It depends on how they're written, of course, but FFTs don't necessarily
care about latency at all: the access/communications pattern might be
total and intricate, but it is entirely deterministic. Back in the 80's
my boss made an FFT engine that the CSIRO (and later SETI) used for radio
astronomy. It used DRAM for all storage, but the compute unit was 100%
saturated, because the computation program and the memory access program
were effectively pre-computed (unrolled) and scheduled around the DRAM
latency and then stored in a ROM (and later that ROM was optimized/
compressed into a state machine.)

I dare say that the FFT routines that run on the big, distributed supers
operate in much the same way, or at least they could.

Cheers,

--
Andrew
From: Paul A. Clayton on
On Mar 11, 9:42 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
[snip]
>         Adding another completely separate processor, with the associated interconnect overhead
> vs      
>         Adding more threading.
[snip]
> (2H+3L+G) / X-multicore
>
> vs
>
> (H+M+3L+G) / X-multithread
>
> But this means that the incremental value, the utilization of the extra flops that you are adding via multithreading,
> may be significantly less that the incremental value in multicore.  The old econmist "Occasionally it is worth selling a
> product below cost, so long as it covers your marginal cost" issue.
>
> It also means that there is a sawtooth wave.   If you want to increase performance by 1.5X, it may be worth adding
> threads;  but when you cross the threshold, you need to add more cores and back off on the threads. I.e. when you add
> cores you want to back off a few generations on the threading.

Just a few quick (obvious) comments:

* Multithreading can increase locality of communication (potentially
even more
so than multiple cores sharing L1 DCache).

* Multithreading encourages 'fat' cores that can exploit temporally
local parallelism
(in some sense a repeat of the previous point--locality of
communication)

* Multicore provides greater thermal separation, potentially providing
thermal
headroom for higher frequency.

* Multicore fits better with 'Bubblewrap/Processor popping'

* Multicore tends to reduce scheduling complexity.

* Multithreading allows finer-grained resource allocation at various
levels of
temporal granularity (a thread could, e.g., use less than the maximum
number of registers, use less execution resources, et al.) without
heterogeneity of cores and interconnect.

Obviously the divisions are not clear cut. Caches, instruction decode
hardware,
OoO support hardware, execution resources, result routing, registers,
etc. can be
shared (or not); hierarchies can have different sharing/latency (I
have not read any
papers suggesting the potential dual use of storage space for 'dead
but not
committed' register values or SoEMT waiting thread context. [Nor have
I read any
suggestion that something like Larrabee's SIMD registers could be used
to hold
waiting thread contexts for server uses with thread richness
applications with
limited data-level parallelism.]).


Paul A. Clayton
still just a technophile