Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: "Andy "Krazy" Glew" on 11 Dec 2009 00:23

Michael S wrote:

> I they are doing all that I simply can't see how one of existing GPUs
> (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
> Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> significant above 1 SP TFLOPs? According to Wikipedia there are not.
> So, either they compare an array of GPUs with single host CPU or their
> host code is very far from optimal. I'd bet on later.

Let's see: http://en.wikipedia.org/wiki/File:Intel_Nehalem_arch.svg
says that Nhm can 2 2 128 bit SSE adds and 1 128 bit SSE mul per cycle.
Now, you might count that as 12 SP FLOPs or 6 DP FLOPS. Multiplied by 8
cores on a chip, you might get 96 SP FLOPS.

However, most supercomputer people count a flop as a
multiply-accumulate. By that standard, Nhm is only 4 SP mul-add FLOPs
per cycle. Add a fudge factor for the extra adder, but certainly not
2X, probably not even 1.5X -- and purists won't even give you that. 32
FLOPS. If you are lucky.

Seldom do you get the 100% utilization of the FMUL unit that you would
need to get 32 SP FLOPS. Especially not when you through in MP bus
contention, thread contention, etc.

Whereas the GPUs tend to have real FMAs. Intel and AMD have both
indicated that they are going the FMA direction. But I don't think that
has shipped yet.

And, frankly, it is easier to tune your code to get good utilization on
a GPU. Yes, easier. Try it yourself. Not for really ugly code, but
for simple codes, yes, CUDA is easier. In my experience. And I'm a
fairly good x86 programmer, and a novice CUDA GPU programmer. I look
forward to Terje reporting his experience tuning code for CUDA (as long
as he isn't tuning wc).

The painful thing about CUDA is the ugly memory model - blocks, blah,
blah, blah. And it is really bad when you have to transfer stuff from
CPU to GPU memory. I hope and expect that Fermi will ameliorate this pain.

---

People are reporting 50x and 100x improvements on CUDA all over the
place. Try it yourself. Be sure to google for tuning advice.

From: "Andy "Krazy" Glew" on 11 Dec 2009 00:26

Andrew Reilly wrote:
> On Wed, 09 Dec 2009 21:25:32 -0800, Andy \"Krazy\" Glew wrote:
>
>> Like I said, I was surprised at how many supercomputer customers
>> expressed this x86 orientation. I expected them to care little about
>> x86.
>
> I still expect those who use Cray or NEC vector supers, or any of the
> scale-up SGI boxes, or any of the Blue-foo systems to care very little
> indeed. The folk who seem to be getting mileage from the CUDA systems
> probably only care peripherally.

Actually some of the CUDA people do care.

They'll use CUDA for the performance critical code, and x86 for all the
rest, in the system it is attached to. With the x86 tools.

Or at least that's what they told me at SC09.

From: "Andy "Krazy" Glew" on 11 Dec 2009 00:36

jgd(a)cix.compulink.co.uk wrote:
> In article <1isTm.93647$Pi.24332(a)newsfe30.ams2>, meru(a)devnull.com
> (ChrisQ) wrote:
>
>> The obvious question then is: Would one of many x86 cores be fast
>> enough on it's own to run legacy windows code like office, photoshop
>> etc ?...
>
> Maybe. But can marketing men convince themselves that this would be the
> case? Almost certainly: a few studies about how many apps the average
> corporate Windows user has open at a time could work wonders. The
> problem, of course, is that most of those apps aren't consuming much CPU
> except when they have input focus. But that's the kind of thing that
> marketing departments are good at neglecting.

At SC09 the watchword was heterogeneity.

E.g. a big OOO x86 core, with small efficient cores of your favorite
flavour. On the same chip.

While you could put a bunch of small x86 cores on the side, I think that
you would probably be better off putting a bunch of small non-x86 cores
on the side. Like GPU cores. Like Nvidia. OR AMD/ATI Fusion.

Although this makes sense to me, I wonder if the people who want x86
really want x86 everywhere - on both the big cores, and the small.

Nobody likes the hetero programming model. But if you get a 100x perf
benefit from GPGPU...

From: Terje Mathisen on 11 Dec 2009 01:48

From: Michael S on 11 Dec 2009 08:10

On Dec 11, 7:23 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
> Michael S wrote:
> > I they are doing all that I simply can't see how one of existing GPUs
> > (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
> > Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> > significant above 1 SP TFLOPs? According to Wikipedia there are not.
> > So, either they compare an array of GPUs with single host CPU or their
> > host code is very far from optimal. I'd bet on later.
>
> Let's see:http://en.wikipedia.org/wiki/File:Intel_Nehalem_arch.svg
> says that Nhm can 2 2 128 bit SSE adds and 1 128 bit SSE mul per cycle.
> Now, you might count that as 12 SP FLOPs or 6 DP FLOPS. Multiplied by 8
> cores on a chip, you might get 96 SP FLOPS.
>

I counted it as 8 SP FLOPs. If wikipedia claims that Nehalem can do 2
FP128 adds per cicle than they are wrong, but more likely you misread
it. Nehalem has only one 1288-bit FP adder attached to port 1, exactly
like the previous members of Core2 family. Port 5 is only "move and
logic", not capable of FP arithmetic.
8 FLOPs/core * 4 cores/chip * 2.93 GHz => 94 GFLOPs

> However, most supercomputer people count a flop as a
> multiply-accumulate.
> By that standard, Nhm is only 4 SP mul-add FLOPs
> per cycle.

Bullshit. Supercomputer people count exactly like everybody else. Look
at "peak flops" in LINPACK reports.

>Add a fudge factor for the extra adder, but certainly not
> 2X, probably not even 1.5X -- and purists won't even give you that. 32
> FLOPS. If you are lucky.
>
> Seldom do you get the 100% utilization of the FMUL unit that you would
> need to get 32 SP FLOPS. Especially not when you through in MP bus
> contention, thread contention, etc.
>
> Whereas the GPUs tend to have real FMAs.

That has nothing to do with calculations in hand. When AMD says that
their new chip does 2.72 TFLOPs they really mean 1.36 TFMAs

>Intel and AMD have both indicated that they are going the FMA direction. But I don't think that
> has shipped yet.

Hopefully, Intel is not going in FMA direction. 3 source operands is a
major PITA for P6-derived Uarch. Most likely requires coordinated
dispatch via two execution ports so it would give nothing for peak
throughput. But you sure know more than me about it.
FMA makes sense on Silverthorne but I'd rather see Silverthorne dead.

>
> And, frankly, it is easier to tune your code to get good utilization on
> a GPU. Yes, easier. Try it yourself. Not for really ugly code, but
> for simple codes, yes, CUDA is easier. In my experience. And I'm a
> fairly good x86 programmer, and a novice CUDA GPU programmer. I look
> forward to Terje reporting his experience tuning code for CUDA (as long
> as he isn't tuning wc).

I'd guess you played with microbenchmerks. Can't imagine it to be true
on real-world code that, yes, is "ugly" but, what can we do, real-
world problems are almost never nice and symmetric.

>
> The painful thing about CUDA is the ugly memory model - blocks, blah,
> blah, blah. And it is really bad when you have to transfer stuff from
> CPU to GPU memory. I hope and expect that Fermi will ameliorate this pain.
>
> ---
>
> People are reporting 50x and 100x improvements on CUDA all over the
> place. Try it yourself. Be sure to google for tuning advice.

First, 95% of the people can't do proper SIMD+multicore on host CPU to
save their lives and that already large proportion of "people are
reporting". Of those honest and knowing what they are doing majority
likely had not computationally bound problem to start with and they
found a way to take advantage of texture units.
According to Terje (see below) that was a case in Seismic code he
brought as an example.

Still, I have a feeling that a majority (not all) of PDE-type problems
that on GPU could be assisted by texture on host CPU could be
reformulated to exploit temporal locality via on-chip cache. But
that's just a feeling, nothing scientific.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: PEEEEEEP
Next: Texture units as a general function