From: George Neuner on
On Fri, 23 Jul 2010 11:30:32 -0700 (PDT), Robert Myers
<rbmyersusa(a)gmail.com> wrote:

>I don't think the hardware and software issues can be separated.

Thank goodness someone said that.

At least where HPC is concerned, I've been convinced for some time
that we are fighting hardware rather than leveraging it. I've spent a
number of years with DSPs and FPGAs and I've come to believe that we
(or at least compilers) need to be deliberately programming memory
interfaces as well as the ALU/FPU operations.

The problem most often cited for vector units is that they need to
support non-consecutive and non-uniform striding to be useful. I
agree that there *does* need to be support for those features, but I
believe it should be in the memory subsystem rather than in the
processor.

I'm supposing that there are vector registers accessible to
scatter/gather DMA and further supposing that there are several DMA
channels - ideally 1 channel per register. The programmer's indexing
loop code is compiled into instructions that program DMA to
read/gather a block of operands into the source registers, execute the
vector operation(s), and finally DMA write/scatters the results back
to memory.

I do understand that problems have to have "crystalline" access
patterns and enough long vector(izable) operations to absorb the
latency of data staging. I know there are plenty of problems that
don't fit that model.

The main issue would be having a main memory that could tolerate
concurrent DMA - but I know that lots of things are possible with
appropriate design: I once worked with a system which had a
proprietary FPGA based memory controller that sustained 1400MB/s - 700
in and out - using banked 100MHz SDRAM (the old kind, not DDR).

I used to have a 40MHz ADI Sharc 21060 (120 MFlops sustained) on a bus
mastering PCI board in a 450MHz Pentium II desktop. I had a number of
programs that turned the DSP into a long vector processor (512 to 4096
element "registers") and used overlapped DMA to move data in and out
while processing. Given a large enough data set that 40MHz DSP could
handily outperform the host's 450MHz CPU.

George
From: nmm1 on
In article <sdtk4654pheq6292135jd42oagr5ov7cg4(a)4ax.com>,
George Neuner <gneuner2(a)comcast.net> wrote:
>On Fri, 23 Jul 2010 11:30:32 -0700 (PDT), Robert Myers
><rbmyersusa(a)gmail.com> wrote:
>
>>I don't think the hardware and software issues can be separated.
>
>Thank goodness someone said that.
>
>At least where HPC is concerned, I've been convinced for some time
>that we are fighting hardware rather than leveraging it. I've spent a
>number of years with DSPs and FPGAs and I've come to believe that we
>(or at least compilers) need to be deliberately programming memory
>interfaces as well as the ALU/FPU operations.
>
>The problem most often cited for vector units is that they need to
>support non-consecutive and non-uniform striding to be useful. I
>agree that there *does* need to be support for those features, but I
>believe it should be in the memory subsystem rather than in the
>processor.

I believe that you have taken the first step on the path to True
Enlightenment, but need to have the courage of your convictions
and proceed further on :-)

I.e. I agree, and what we need is architectures which are designed
to provide data management first and foremost, and which attach
the computation onto that. I.e. turn the traditional approach on
its head. And I don't think that is limited to HPC, either.
I can't see any of the decent computer architects having any great
problem with this concept, but I doubt that the benchmarketers and
execudroids would swallow it.

It would also need a comparable revolution in programming languages
and paradigms, though there have been a lot of exploratory ones that
show the concepts are viable.


Regards,
Nick Maclaren.
From: nmm1 on
In article <cveqh7-ad2.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>If you check actual SIMD type code, you'll notice that various forms of
>permutations are _very_ common, i.e. you need to rearrange the order of
>data in one or more vector register:
>
>If vectors were processed in streaming mode, we would have the same
>situation as for the Pentium4 which did half a register in each half
>cycle in the fast core, but had to punt each time you did a right shift
>(or any other operations which could not be processed in LE order).
>
>I have seen once a reference to Altivec code that used the in-register
>permute operation more than any other opcode.
>
>Except that even scalar code needs prefix/mask type operations in order
>to get rid of some branches, right?
>
>All (most of?) the others seem to boil down to a need for a fast vector
>permute...

Yes. My limited investigations indicated that the viability of
vector systems usually boiled down to whether the hardware's ability
to do that was enough to meet the software's requirements. If not,
it spent most of its time in scalar code.

>> But also because, as I discusssed in my Berkeley Parlab presentation of
>> Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate
>> somewhat the deficiencies of coherent threading, specifically the
>> problem of divergence.
>
>Please tell!

Indeed, yes, please do!


Regards,
Nick Maclaren.
From: Robert Myers on
On Jul 24, 6:01 am, n...(a)cam.ac.uk wrote:
> In article <sdtk4654pheq6292135jd42oagr5ov7...(a)4ax.com>,
> George Neuner  <gneun...(a)comcast.net> wrote:
>
>
> >At least where HPC is concerned, I've been convinced for some time
> >that we are fighting hardware rather than leveraging it.  I've spent a
> >number of years with DSPs and FPGAs and I've come to believe that we
> >(or at least compilers) need to be deliberately programming memory
> >interfaces as well as the ALU/FPU operations.
>
> >The problem most often cited for vector units is that they need to
> >support non-consecutive and non-uniform striding to be useful.  I
> >agree that there *does* need to be support for those features, but I
> >believe it should be in the memory subsystem rather than in the
> >processor.
>
> I believe that you have taken the first step on the path to True
> Enlightenment, but need to have the courage of your convictions
> and proceed further on :-)
>
> I.e. I agree, and what we need is architectures which are designed
> to provide data management first and foremost, and which attach
> the computation onto that.  I.e. turn the traditional approach on
> its head.  And I don't think that is limited to HPC, either.
> I can't see any of the decent computer architects having any great
> problem with this concept, but I doubt that the benchmarketers and
> execudroids would swallow it.

Ok. So here's a half-baked guess.

The reason that doesn't happen isn't to be found in the corner office,
but in your thread about RDMA and Andy's comments in that thread, in
particular.

Today's computers are *not* designed around computation, but around
coherent cache. Now that the memory controller is on the die, the
takeover is complete. Nothing moves efficiently without notice and
often unnecessary involvement of the real Von Neumann bottleneck,
which is the cache.

Cache snooping is the one ring that rules them all.

I doubt if an implausible journey through Middle Earth by fantastic
creatures would help, but probably some similarly wild exercise of the
imagination is called for.

Currently, you cluster processors when you can't conveniently jam them
all into a single coherence domain. The multiple coherence domains
that result are an annoyance to someone like me who would desperately
like to think in terms of one big, flat memory space, but they also
allow new possibilities, like moving data around without bothering
other processors and other coherence domains. Maybe you want multiple
coherence domains even when you aren't forced into it by the size of a
board or a rack or a mainframe.

Maybe you want more programmable control over coherence domains. If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.

Robert.

From: nmm1 on
In article <88d23585-d47c-47af-91a1-7bae764afaf8(a)q22g2000yqm.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>
>Today's computers are *not* designed around computation, but around
>coherent cache. Now that the memory controller is on the die, the
>takeover is complete. Nothing moves efficiently without notice and
>often unnecessary involvement of the real Von Neumann bottleneck,
>which is the cache.

Yes and no. Their interfaces are still designed around computation,
and the coherent cache is designed to give the impression that
programmers need not concern themselves with programming memory
access - it's all transparent.

>Maybe you want more programmable control over coherence domains. If
>you're not going to scrap cache and cache snooping, maybe you can
>wrestle some control away from the hardware and give it to the
>software.

That is, indeed, part of what I do mean.


Regards,
Nick Maclaren.