Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: Michael S on 10 Dec 2009 04:06

On Dec 10, 10:32 am, Terje Mathisen <Terje.Mathi...(a)tmsw.no> wrote:
> Robert Myers wrote:
> > Nvidia stock has drooped a bit after the *big* bounce it took on the
> > Larrabee announcement, but I'm not sure why everyone is so negative on
> > Nvidia (especially Andy). They don't appear to be in much more
> > parlous a position than anyone else. If Fermi is a real product, even
> > if only at a ruinous price, there will be buyers.
>
> I have seen a report by a seismic processing software firm, indicating
> that their first experiments with GPGPU programming had gone very well:
>
> After 8 rounds of optimization, which basically consisted of mapping
> their problem (acoustic wave propagation, according to Kirchoff) onto
> the actual capabilities of a GPU card, they went from being a little
> slower than the host CPU up to nearly two orders of magnitude faster.
>
> This meant that Amdahl's law started rearing it's ugly head: The setup
> overhead took longer than the actual processing, so now they are working
> on moving at least some of that surrounding code on the GPU as well.
>
> Anyway, with something like 40-100x speedups, oil companies will be
> willing to spend at least $1000+ per chip.
>
> However, I'm guessing that the global oil processing market has not more
> than 100 of the TOP500 clusters, so this is 100K to 1M chips if everyone
> would scrap their current setup.
>
> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

"8 rounds of optimization", that's impressive.
I wonder how much speed-up could they get from the host CPU after just
3 rounds:
1. double->single, to reduce memory footprint
2. SIMD
3. Exploit all available cores/threads

From: nmm1 on 10 Dec 2009 04:13

In article <4B20864C.90607(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>A surprisingly large number of supercomputer customers use libraries and
>tools that have some specific x86 knowledge.

It used to be very few, but is increasing.

>Like I said, I was surprised at how many supercomputer customers
>expressed this x86 orientation. I expected them to care little about x86.

A lot of it is due to the change in 'community' and their applications.
It's not just the libraries, but I have little to add to what you
said on those (well, other examples, but so?)

Traditionally, people were really up against hard limits, and they
were prepared to both spend serious effort in tuning and switch to
whichever system offered them most time. There still are a lot like
that. Fortran and MPI dominate, and few people give a damn about the
architecture.

An increasing number want to use a 'supercomputer' as an alternative
to tuning their code. Some of those codes are good, some are merely
inefficient, some are unnecessarily x86-dependent, and some LOOK
x86-dependent because they are just plain broken. C++ and shared
memory dominate.

And, as usual, nothing is hard and fast, so there are intermediates
and mixtures and ....

Regards,
Nick Maclaren.

From: Terje Mathisen on 10 Dec 2009 04:48

Michael S wrote:
> On Dec 10, 10:32 am, Terje Mathisen<Terje.Mathi...(a)tmsw.no> wrote:
>> I have seen a report by a seismic processing software firm, indicating
>> that their first experiments with GPGPU programming had gone very well:
>>
>> After 8 rounds of optimization, which basically consisted of mapping
>> their problem (acoustic wave propagation, according to Kirchoff) onto
>> the actual capabilities of a GPU card, they went from being a little
>> slower than the host CPU up to nearly two orders of magnitude faster.
>>
>> This meant that Amdahl's law started rearing it's ugly head: The setup
>> overhead took longer than the actual processing, so now they are working
>> on moving at least some of that surrounding code on the GPU as well.
>>
>> Anyway, with something like 40-100x speedups, oil companies will be
>> willing to spend at least $1000+ per chip.
>>
>> However, I'm guessing that the global oil processing market has not more
>> than 100 of the TOP500 clusters, so this is 100K to 1M chips if everyone
>> would scrap their current setup.
>>
>> Terje
> "8 rounds of optimization", that's impressive.
> I wonder how much speed-up could they get from the host CPU after just
> 3 rounds:
> 1. double->single, to reduce memory footprint
> 2. SIMD
> 3. Exploit all available cores/threads

I'm pretty sure they are already doing all of those, at least in the lab
where they tested GPGPU.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Thomas Womack on 10 Dec 2009 05:24

In article <4B207537.5030205(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>Also, AMD/ATI definitely overtook Nvidia. I think that Nvidia
>emphasized elegance, and GP GPU futures stuff, whereas ATI went the
>slightly inelegant way of combining SIMT Coherent Threading with VLIW.
>It sounds more elegant when you phrase it my way, "combining SIMT
>Coherent Threading with VLIW", than when you have to describe it without
>my terminology. Anyway, ATI definitely had a performance per transistor
>advantage.

ATI win on performance, but nVidia win by miles on GPGPU software
development, simply because they've picked a language and stuck with
it, and at some point some high-up insisted that the GPGPU compilers
be roughly synchronised with the hardware releases; I expect to be
able to pick up a Fermi card, download the latest nvidia SDK, build
something linked with cufft, and get a reasonable performance.

ATI's compiler and driver stack, to the best of my knowledge, doesn't
support double precision yet, well after the second generation of
chips with DP on has appeared.

An AMD employee posted in their OpenCL forum about four weeks ago:

"Double precision floating point support is important for us. We are
planning to begin to introduce double precision arithmetic support in
first half of 2010 as well as the start of some built-ins over time."

Tom

From: Michael S on 10 Dec 2009 09:01

On Dec 10, 11:48 am, Terje Mathisen <Terje.Mathi...(a)tmsw.no> wrote:
> Michael S wrote:
> > On Dec 10, 10:32 am, Terje Mathisen<Terje.Mathi...(a)tmsw.no> wrote:
> >> I have seen a report by a seismic processing software firm, indicating
> >> that their first experiments with GPGPU programming had gone very well:
>
> >> After 8 rounds of optimization, which basically consisted of mapping
> >> their problem (acoustic wave propagation, according to Kirchoff) onto
> >> the actual capabilities of a GPU card, they went from being a little
> >> slower than the host CPU up to nearly two orders of magnitude faster.
>
> >> This meant that Amdahl's law started rearing it's ugly head: The setup
> >> overhead took longer than the actual processing, so now they are working
> >> on moving at least some of that surrounding code on the GPU as well.
>
> >> Anyway, with something like 40-100x speedups, oil companies will be
> >> willing to spend at least $1000+ per chip.
>
> >> However, I'm guessing that the global oil processing market has not more
> >> than 100 of the TOP500 clusters, so this is 100K to 1M chips if everyone
> >> would scrap their current setup.
>
> >> Terje
> > "8 rounds of optimization", that's impressive.
> > I wonder how much speed-up could they get from the host CPU after just
> > 3 rounds:
> > 1. double->single, to reduce memory footprint
> > 2. SIMD
> > 3. Exploit all available cores/threads
>
> I'm pretty sure they are already doing all of those, at least in the lab
> where they tested GPGPU.
>
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

I they are doing all that I simply can't see how one of existing GPUs
(i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
significant above 1 SP TFLOPs? According to Wikipedia there are not.
So, either they compare an array of GPUs with single host CPU or their
host code is very far from optimal. I'd bet on later.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: PEEEEEEP
Next: Texture units as a general function