High-bandwidth computing interest group [Computer Architecture]

Prev: Last Call for Papers Reminder (extended): World Congress on Engineering and Computer Science WCECS 2010
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))

From: Thomas Womack on 23 Jul 2010 13:19

In article <34ea667e-779a-44d8-ab63-c032df1cb067(a)q35g2000yqn.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>At a time when vector processors were still a fading memory (even in
>the US), an occasional article would mention that "vector computers"
>were easier to use for many scientists than thousands of cots
>processors hooked together by whatever.

Yes, this is certainly true. Earth Simulator demonstrated that you
could build a pretty impressive vector processor, which (Journal of
the Earth Simulator - one of the really good resources since it talks
about both the science and the implementation issues) managed 90%
performance on lots of tasks, partly because using it was very
prestigious and you weren't allowed to use the whole machine on jobs
which didn't manage very high performance on a 10% subset. But it was
a $400 million project to build a 35Tflops machine, and the subsequent
project to spend a similar amount this decade on a heftier machine
came to nothing.

I've worked at an establishment with an X1, and it was woefully
under-used because the problems that came up didn't fit the vector
organisation terribly well; it is not at all clear why they bought the
X1 in the first place.

>The real problem is not in how the computation is organized, but in
>how memory is accessed. Replicating the memory access style of the
>early Cray architectures isn't possible beyond a very limited memory
>size, but it sure would be nice to figure out a way to simulate the
>experience.

I _think_ this starts, particularly for the crystalline memory access
case, to be almost a language-design issue.

Tom

From: Robert Myers on 23 Jul 2010 14:30

On Jul 23, 1:19 pm, Thomas Womack <twom...(a)chiark.greenend.org.uk>
wrote:
> In article <34ea667e-779a-44d8-ab63-c032df1cb...(a)q35g2000yqn.googlegroups..com>,
> Robert Myers <rbmyers...(a)gmail.com> wrote:
>
> >At a time when vector processors were still a fading memory (even in
> >the US), an occasional article would mention that "vector computers"
> >were easier to use for many scientists than thousands of cots
> >processors hooked together by whatever.
>
> Yes, this is certainly true. Earth Simulator demonstrated that you
> could build a pretty impressive vector processor, which (Journal of
> the Earth Simulator - one of the really good resources since it talks
> about both the science and the implementation issues) managed 90%
> performance on lots of tasks, partly because using it was very
> prestigious and you weren't allowed to use the whole machine on jobs
> which didn't manage very high performance on a 10% subset. But it was
> a $400 million project to build a 35Tflops machine, and the subsequent
> project to spend a similar amount this decade on a heftier machine
> came to nothing.
>
> I've worked at an establishment with an X1, and it was woefully
> under-used because the problems that came up didn't fit the vector
> organisation terribly well; it is not at all clear why they bought the
> X1 in the first place.
>
So, if you can cheaply build a machine with lots of flops that
sometimes you can't use, who cares if the flops you *can* use are
still more plentiful and less expensive than, say, an Earth Simulator
style effort, especially if there are lots of problems for which the
magnificently awesome vector processor is useless? That's essentially
the argument to defend the purchasing decisions that are being made at
a national level in the US.

I would agree, if only I could wrestle a tiny concession from the
empire-builders. The machines they are building are *not* scalable,
and I wish they'd stop claiming they are. It would be like my cable
company claiming that its system is scalable because it can hang as
many users off the same cable as it can get away with. It's all very
well until too many try to use the bandwidth at once.

Having the bandwidth per flop drop to zero is no different from having
the bandwidth per user drop to zero, and even my cable company, which
has lots of gall, wouldn't have the gall to claim that it's not a
problem and that they don't have to worry about it, because they do.

> >The real problem is not in how the computation is organized, but in
> >how memory is accessed. Replicating the memory access style of the
> >early Cray architectures isn't possible beyond a very limited memory
> >size, but it sure would be nice to figure out a way to simulate the
> >experience.
>
> I _think_ this starts, particularly for the crystalline memory access
> case, to be almost a language-design issue.
>

Engineers apparently find Mathlab easy to use. No slight to Matlab,
but the disconnect with the hardware can be painful. I don't think
the hardware and software issues can be separated.

Robert.

From: Andy Glew "newsgroup at on 23 Jul 2010 15:01

On 7/21/2010 3:18 PM, George Neuner wrote:
> On Tue, 20 Jul 2010 15:41:13 +0100 (BST), nmm1(a)cam.ac.uk wrote:
>
>> In article<04cb46947eo6mur14842fqj45pvrqp61l1(a)4ax.com>,
>> George Neuner<gneuner2(a)comcast.net> wrote:
>>>
>>> ISTM bandwidth was the whole point behind pipelined vector processors
>>> in the older supercomputers. ...
>>> ... the staging data movement provided a lot of opportunity to
>>> overlap with real computation.
>>>
>>> YMMV, but I think pipeline vector units need to make a comeback.
>>
>> NO chance! It's completely infeasible - they were dropped because
>> the vendors couldn't make them for affordable amounts of money any
>> longer.

>
> Actually I'm a bit skeptical of the cost argument ... obviously it's
> not feasible to make large banks of vector registers fast enough for
> multiple GHz FPUs to fight over, but what about a vector FPU with a
> few dedicated registers?

I have been reading this thread somewhat bemused.

To start, full disclosure: I have proposed having pipelined vector
instructions making a comeback, in my postings to this group and my
presentations, e.g. at Berkeley Parlab (linked to on some web page).
Reason: not to improve performance, but to reduce costs compared to what
is now done now.

What is done now?

There are CPUs with FPUs pipelined circa 5-7 cycles deep. Commonly 2
sets of 4 32-bit SP elements wide, sometimes 8 or 16 wide.

There are GPUs with 256-1024 SP FPUs on them. I'm not so sure about
pipeline depth, but it is indicated to be deep by recommendations that
dependent ops not be closer together than 40 cycles.

The GPUs often have 16KB of registers. For each group of 32 or so FPUs.

I.e. we are building systems with more FPUs, more deeply pipelined FPUs,
and more registers than the vector machines I am most familiar with,
Cray-1 era machines. I don't know by heart the specs for the last few
generations of vector machines before they died back, but I suspect that
modern CPUs and, especially, GPUs, are comparable.

Except
(1) they are not organized as vector machines, and
(2) the memory subsystems are less powerful, in proportion to the FPUs,
than in the old days.

I'm going to skate past the memory subsystems since we have talked about
this at length elsewhere, and since that will be the topic of Robert
Myers' new mailing list. Except to say (a) high end GPUs often have
memory separate from the main CPU memory, made with more expebsive
GDRAMs rather than conventional DRAMs, and (b) modern DRAMs emphasize
sequential burst accesses in ways that Cray-1's SRAM based memory
subsystem did not. Basically, commodity DRAM does not lend itself to
non-unit-stride access patterns. And building a big system out of
non-commodity memory is much more prohibitive than back in the day of
the Cray-1. This was becoming evident in the last years of the old
vector processors.

But let's get back to the fact that these modern machines, with more
FPUs more deeply pipelined, and with more registers, than the classic
vector machines, are not organized as pipelined vector machines. To
some limited extent they are small parallel vector machines - operating
on 4 32b SP in a given operation, in parallel in one instruction. The
actual FPU operation is pipelined. They may be a small degree of vector
pipelining, e.g. spreading an 8 element vector over 2 cycles. But not
the same degree of vector pipelining as in the okd days, where a single
instruction may be pipelined over 16 or 64 cycles.

Why aren't modern CPUs and GPUs vector pipelined? I think one of the
not widely recognized things is that we are significantly better at
pipelining now than in the old days. The Cray-1 had 8 gate delays per
cycle. I suspect that one of the motivations for vectors was that it
was a challenge to decode back to back dependent instructions at that
rate, whereas it was easier to decode an instruction, set up a vector,,
and then run that vector instruction for 16 to 64 cycles. Yes, arranging
chaining, and yes, I know that one of the Cray-1's claims to fame was
better scalar instruction performance.

If you can run individual scalar instructions as fast as you can run
vector instructions, giving the same FLOPS, wouldn't you rather? Why
use vectors rather than scalars?
I'll answer my own question: (a) vectors allow you to use the same
number of register bits to specificy a lot more registers -
#vector-registers * #elements per vector. (b) vectors save power - you
onl;y decode the instruction once, and the decoding and scheduling logic
getsd amortized over the entire vector.
But if you aren't increasing the register set or adding new types
of registers, and if you aren't that worried about power, then you don't
need vectors.
But we are worried about power, aren't we?

Why aren't modern GPUs vector pipelined? Basically because they are
SIMD, or, rather, SIMD in its modern evolution of SIMT, CIMT, Coherent
Threading. This nearly always gets 4 cycle's worth of amortization of
instruction decode and schedule cost. And it seems to be easier to
program. And it promotes portability.

When I started working on GPUs, I thought, like many on this newsgroup,
that vector ISAs were easier to program than SIMD GPUs. I was quite
surprised to find out that this is NOT the case. Graphics programmers
consistengtly prefer the SIMD programming model. Or, rather, they
conistently prefer to have lots of little threads executing scalar or
moderate VLIW or short vector instructions, rather than fewer,
heavyweight, threads executing longer vector instructions. Partly
because their problems tend to be short vector, 4 element, rather than
long vector operations. Perhaps because SIMD is what they are familiar
with - although, again I emphasize than SIMT/CIMT is not the same as
classic Illiac-IV SIMD. I think that one of the most important aspects
is that SIMD/SIMT/CIMT code is more portable - it runs fairly well on
both GPUs and CPUs. And it runs on GPUs no matter whether the parallel
FPUs, what would be the vector FPUs, are 16 wide x 4 cycles, or 8 wide x
8 cycles, or ...

From: Andy Glew "newsgroup at on 23 Jul 2010 20:28

The workday officialy over at 5pm, so I can continue the post I started
at lunch. (Although I am pretty sure to get back to work this evening.)

Top quoting without deleting my previous post - so you'll have to scroll
way down.

On 7/23/2010 12:01 PM, Andy Glew wrote:
> On 7/21/2010 3:18 PM, George Neuner wrote:
>> On Tue, 20 Jul 2010 15:41:13 +0100 (BST), nmm1(a)cam.ac.uk wrote:
>>
>>> In article<04cb46947eo6mur14842fqj45pvrqp61l1(a)4ax.com>,
>>> George Neuner<gneuner2(a)comcast.net> wrote:
>>>>
>>>> ISTM bandwidth was the whole point behind pipelined vector processors
>>>> in the older supercomputers. ...
>>>> ... the staging data movement provided a lot of opportunity to
>>>> overlap with real computation.
>>>>
>>>> YMMV, but I think pipeline vector units need to make a comeback.
>>>
>>> NO chance! It's completely infeasible - they were dropped because
>>> the vendors couldn't make them for affordable amounts of money any
>>> longer.
>
>>
>> Actually I'm a bit skeptical of the cost argument ... obviously it's
>> not feasible to make large banks of vector registers fast enough for
>> multiple GHz FPUs to fight over, but what about a vector FPU with a
>> few dedicated registers?
>
>
> I have been reading this thread somewhat bemused.
>
> To start, full disclosure: I have proposed having pipelined vector
> instructions making a comeback, in my postings to this group and my
> presentations, e.g. at Berkeley Parlab (linked to on some web page).
> Reason: not to improve performance, but to reduce costs compared to what
> is now done now.
>
> What is done now?
>
> There are CPUs with FPUs pipelined circa 5-7 cycles deep. Commonly 2
> sets of 4 32-bit SP elements wide, sometimes 8 or 16 wide.
>
> There are GPUs with 256-1024 SP FPUs on them. I'm not so sure about
> pipeline depth, but it is indicated to be deep by recommendations that
> dependent ops not be closer together than 40 cycles.
>
> The GPUs often have 16KB of registers. For each group of 32 or so FPUs.
>
> I.e. we are building systems with more FPUs, more deeply pipelined FPUs,
> and more registers than the vector machines I am most familiar with,
> Cray-1 era machines. I don't know by heart the specs for the last few
> generations of vector machines before they died back, but I suspect that
> modern CPUs and, especially, GPUs, are comparable.
>
> Except
> (1) they are not organized as vector machines, and
> (2) the memory subsystems are less powerful, in proportion to the FPUs,
> than in the old days.
>
> I'm going to skate past the memory subsystems since we have talked about
> this at length elsewhere, and since that will be the topic of Robert
> Myers' new mailing list. Except to say (a) high end GPUs often have
> memory separate from the main CPU memory, made with more expebsive
> GDRAMs rather than conventional DRAMs, and (b) modern DRAMs emphasize
> sequential burst accesses in ways that Cray-1's SRAM based memory
> subsystem did not. Basically, commodity DRAM does not lend itself to
> non-unit-stride access patterns. And building a big system out of
> non-commodity memory is much more prohibitive than back in the day of
> the Cray-1. This was becoming evident in the last years of the old
> vector processors.
>
> But let's get back to the fact that these modern machines, with more
> FPUs more deeply pipelined, and with more registers, than the classic
> vector machines, are not organized as pipelined vector machines. To some
> limited extent they are small parallel vector machines - operating on 4
> 32b SP in a given operation, in parallel in one instruction. The actual
> FPU operation is pipelined. They may be a small degree of vector
> pipelining, e.g. spreading an 8 element vector over 2 cycles. But not
> the same degree of vector pipelining as in the okd days, where a single
> instruction may be pipelined over 16 or 64 cycles.
>
> Why aren't modern CPUs and GPUs vector pipelined? I think one of the not
> widely recognized things is that we are significantly better at
> pipelining now than in the old days. The Cray-1 had 8 gate delays per
> cycle. I suspect that one of the motivations for vectors was that it was
> a challenge to decode back to back dependent instructions at that rate,
> whereas it was easier to decode an instruction, set up a vector,, and
> then run that vector instruction for 16 to 64 cycles. Yes, arranging
> chaining, and yes, I know that one of the Cray-1's claims to fame was
> better scalar instruction performance.
>
> If you can run individual scalar instructions as fast as you can run
> vector instructions, giving the same FLOPS, wouldn't you rather? Why use
> vectors rather than scalars?
> I'll answer my own question: (a) vectors allow you to use the same
> number of register bits to specificy a lot more registers -
> #vector-registers * #elements per vector. (b) vectors save power - you
> onl;y decode the instruction once, and the decoding and scheduling logic
> getsd amortized over the entire vector.
> But if you aren't increasing the register set or adding new types of
> registers, and if you aren't that worried about power, then you don't
> need vectors.
> But we are worried about power, aren't we?
>
> Why aren't modern GPUs vector pipelined? Basically because they are
> SIMD, or, rather, SIMD in its modern evolution of SIMT, CIMT, Coherent
> Threading. This nearly always gets 4 cycle's worth of amortization of
> instruction decode and schedule cost. And it seems to be easier to
> program. And it promotes portability.
>
> When I started working on GPUs, I thought, like many on this newsgroup,
> that vector ISAs were easier to program than SIMD GPUs. I was quite
> surprised to find out that this is NOT the case. Graphics programmers
> consistengtly prefer the SIMD programming model. Or, rather, they
> conistently prefer to have lots of little threads executing scalar or
> moderate VLIW or short vector instructions, rather than fewer,
> heavyweight, threads executing longer vector instructions. Partly
> because their problems tend to be short vector, 4 element, rather than
> long vector operations. Perhaps because SIMD is what they are familiar
> with - although, again I emphasize than SIMT/CIMT is not the same as
> classic Illiac-IV SIMD. I think that one of the most important aspects
> is that SIMD/SIMT/CIMT code is more portable - it runs fairly well on
> both GPUs and CPUs. And it runs on GPUs no matter whether the parallel
> FPUs, what would be the vector FPUs, are 16 wide x 4 cycles, or 8 wide x
> 8 cycles, or ...

Continuing the discussion of the advantages of vector instruction sets
and hardware.

Vector ISAs allow you to have a whole lot of registers accessible from
relatively small register numbers in the instruction. GPU
SIMD/SIMT/CIMT get the same effect by having a whole lot of threads,
each given a variable number of registers. Basically, reducing the
number of registers allocated to threads (which run in warps or
wavefronts, say 16 wide over 4 cycles) is equivalent to, and probably
better that, having a variable vector length. Variable on a per vector
register basis. I'm not aware of many classic vector ISAs doing this -
and if they did, they would lose the next advantage.

Vector register files can be cheaper than ordinary register files.
Instead of having to allow any register to be accessed, vector ISAs
allow you to only have to index the first element of a vector fast;
subsequent elements can stream along with greater latency. However, I'm
not aware of any recent vector hardware uarch that has taken advantage
of this possibility. Usually they build just a great big wide register file.

Vector ISAs are on a slippery slope of ISA complexity. First you have
vector+vector ->vector ops. Then you add vector sum reductions. Inner
products. Prefix calculations. Operate under mask. Etc. This slippery
slope seems much less slippery for CIMT, since most of these
opeerations can be synthesized simply out of the scalar operations that
are their basis.

Vector chaining is a source of performance - and complexity. It happens
somewhat for free with Nvidia style scalar SIMT, and the equivalent of
more complicated chaining complexes can be set up using ATI/AMD's VLIW SIMT.

All this being said, why would I be interested in reviving vector ISAs?

Mainly because vector ISAs allow the cost of instruction decode and
scheduling to be amortized.

But also because, as I discusssed in my Berkeley Parlab presentation of
Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate
somewhat the deficiencies of coherent threading, specifically the
problem of divergence.

From: Terje Mathisen "terje.mathisen at on 24 Jul 2010 02:48

Andy Glew wrote:
> Vector register files can be cheaper than ordinary register files.
> Instead of having to allow any register to be accessed, vector ISAs
> allow you to only have to index the first element of a vector fast;
> subsequent elements can stream along with greater latency. However, I'm
> not aware of any recent vector hardware uarch that has taken advantage
> of this possibility. Usually they build just a great big wide register
> file.

And this is needed!

If you check actual SIMD type code, you'll notice that various forms of
permutations are _very_ common, i.e. you need to rearrange the order of
data in one or more vector register:

If vectors were processed in streaming mode, we would have the same
situation as for the Pentium4 which did half a register in each half
cycle in the fast core, but had to punt each time you did a right shift
(or any other operations which could not be processed in LE order).

I have seen once a reference to Altivec code that used the in-register
permute operation more than any other opcode.
>
> Vector ISAs are on a slippery slope of ISA complexity. First you have
> vector+vector ->vector ops. Then you add vector sum reductions. Inner
> products. Prefix calculations. Operate under mask. Etc. This slippery
> slope seems much less slippery for CIMT, since most of these opeerations
> can be synthesized simply out of the scalar operations that are their
> basis.

Except that even scalar code needs prefix/mask type operations in order
to get rid of some branches, right?

All (most of?) the others seem to boil down to a need for a fast vector
permute...
>
> Vector chaining is a source of performance - and complexity. It happens
> somewhat for free with Nvidia style scalar SIMT, and the equivalent of
> more complicated chaining complexes can be set up using ATI/AMD's VLIW
> SIMT.
>
> All this being said, why would I be interested in reviving vector ISAs?
>
> Mainly because vector ISAs allow the cost of instruction decode and
> scheduling to be amortized.
>
> But also because, as I discusssed in my Berkeley Parlab presentation of
> Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate
> somewhat the deficiencies of coherent threading, specifically the
> problem of divergence.

Please tell!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"