From: nmm1 on
In article <87my56fcj8.fsf(a)ami-cg.GraySage.com>,
Chris Gray <cg(a)graysage.com> wrote:
>nmm1(a)cam.ac.uk writes:
>
>> As you know, I am radical among radicals, but what I should like to
>> see is a 1,024 core chip, with an enhanced 2-D grid memory topology,
>> back-to-back with its NON-shared memory, NO hardware floating-point,
>
>Why do you say no HW float, Nick? You know far more about dealing with
>applications and floating point than I do, but as an ex-Myriod, my
>recollection is that the lack of HW floating point was one of the big
>problems with the first Myrias system (SPS-1, based on MC68000). The SW
>floating point we had was just so slow that it outweighed any other factors
>relating to the system (like cost). It could have been faster if it didn't
>need to stick to IEEE semantics, but I doubt if it would have been an order
>of magnitude faster.

Interestingly enough, part of my reason for saying that was ANOTHER
68000-based system! It implemented floating-point in software, and
achieved comparable performance to the IBM PC/RT's hardware (a low
hurdle, I agree). Way back then, it wouldn't have flown (though
I did first suggest it over 20 years ago!) However, the value of
almost every parameter has changed since then.

Firstly, the current approach enshrines a very poor specification
(from the point of view of both RAS and performance) in concrete.
I should like to introduce a bit of flexibility. Actually, all I
really want is to abolish signed zeroes and have division by zero
deliver a NaN! That removes 95% of the gotchas. Oh, and to use
hard underflow, but that's purely for performance.

Secondly, on modern processes, there is usually a HELL of a lot of
spare CPU time - the 'CPU' bottlenecks are almost entirely memory
access and glitch recovery. It is a rare application that is
dominated by the floating-point unit, as such, nowadays.

Thirdly, the historically dire performance of software floating-point
is between 50% and 99% (sic) due to the method of providing it, and
not the emulation. By providing a very simple usercode facility,
that essentially vanishes.

Fourthly, between 50% and 75% of the cost of the emulation as such
is the fact that the software primitives are unsuitable, but are fast
(and often trivial) in hardware. Designing such primitives would
need care, but they could easily be provided.

Fifthly, by abolishing the separation from the integer and floating-
point pipelines, you get some performance advantages. Much more
importantly, you get a lot more parallel scalability - if you want
to make yourself feel very ill, study the EXACT specifications of
how they are synchronised between threads, not forgetting the cases
when their memory paths and interrupt-based fixups are different.


Regards,
Nick Maclaren.
From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:
>
> Fourthly, between 50% and 75% of the cost of the emulation as such
> is the fact that the software primitives are unsuitable, but are fast
> (and often trivial) in hardware. Designing such primitives would
> need care, but they could easily be provided.
>
> Fifthly, by abolishing the separation from the integer and floating-
> point pipelines, you get some performance advantages. Much more
> importantly, you get a lot more parallel scalability - if you want
> to make yourself feel very ill, study the EXACT specifications of
> how they are synchronised between threads, not forgetting the cases
> when their memory paths and interrupt-based fixups are different.
>
>

Lets look at the impact of implementing software FP, augmented by the
necessary HW support.

1. You give up 1/2 the registers. Typically FP instructions (implicitly)
use a different register set. This increases the number of names
available to a compiler, while reducing the orthogonality in the ISA.
What would you rather have: 32 register that can be used by both FP and
integer ops, or 32 registers that can be used by FP annd 32 register
that can be used by integer, with an additional cost to transfer between
them?

2. Can you live with 32 bits of manitssa + exponent or do you want 64?
If you want 64, are the integer registers also 64 bit? Or will you use
register pairs for floating-point emulation? Note that a 64 bit add vs a
32 bit add costs you about 20-40% cycle time.

3. What width multiplier do you want? Most integer apps only need 32x32
which can be done in 2-3 arithmetic 32-bit add equivalents. A true
64x64 is more like 4-5 stages. A 48x48 multiplier is probably 3-4 stages.

4. If you support pipelines of different lengths, you have the same
synchronization issues; it doesn't matter whether they are all integer
or not. So most processors with both a multiply and and an add are going
to have the synch issue anyway.

5. You're going to have normalization instructions anyway. You may get
away with not using them between sequences of multiplies, but once you
get to adding, then you'll be normalizing after each op. You don't
necessarily get any wins by breaking them out.

In other words, if you're going to extend the instruction set to
usefully support FP, you might as well go the whole hog and truly
support FP.

BTW: none of this requires that FP and integer can't share the data
path, nor does this imply anything about superscalar/OoO implementations
From: nmm1 on
In article <JY6dnaLYurTZJDvXnZ2dnUVZ_tqdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>Lets look at the impact of implementing software FP, augmented by the
>necessary HW support.
>
>1. You give up 1/2 the registers. Typically FP instructions (implicitly)
>use a different register set. ...

And exactly why can't you continue to do that?

>What would you rather have: 32 register that can be used by both FP and
>integer ops, or 32 registers that can be used by FP annd 32 register
>that can be used by integer, with an additional cost to transfer between
>them?

Why not 64 that can be used for either?

>2. Can you live with 32 bits of manitssa + exponent or do you want 64?
>If you want 64, are the integer registers also 64 bit? Or will you use
>register pairs for floating-point emulation? Note that a 64 bit add vs a
>32 bit add costs you about 20-40% cycle time.

Except on embedded designs, ALL modern CPUs use 64 bit integer
registers. What decade are you living in?

Furthermore, operations using multiple sequential registers (they
don't have to be pairs) isn't a problem if you have enough registers.
Nor has it been for 40+ years. I can't speak for the 1950s.

>3. What width multiplier do you want? Most integer apps only need 32x32
> which can be done in 2-3 arithmetic 32-bit add equivalents. A true
>64x64 is more like 4-5 stages. A 48x48 multiplier is probably 3-4 stages.

Therein hangs a tale .... A full answer is too long, but consider
32x32=>32, with selectable shifting and separate top and bottom
instructions. VERY useful for integer work, too. A full 64x64
multiply is then only 4 of those. Once could even consider having
only 16x16=>32, too.

Remember that, if you design it right, and make the multiplication
a multiply-and-accumulate, it isn't all that many instructions and
is pipelinable/parallelisable up to the eyeballs.

>4. If you support pipelines of different lengths, you have the same
>synchronization issues; it doesn't matter whether they are all integer
>or not. So most processors with both a multiply and and an add are going
>to have the synch issue anyway.

If you are referring to the synchronisation issues that I was
referring to, the length of the pipeline is largely irrelevant.
They arise from different causes.

>5. You're going to have normalization instructions anyway. You may get
>away with not using them between sequences of multiplies, but once you
>get to adding, then you'll be normalizing after each op. You don't
>necessarily get any wins by breaking them out.

Oh, really? Not merely does that not apply when writing numerical
functions (which are often where the time goes), there are a fair
number of established techniques for avoiding that. Of course,
they mostly date from the discrete logic era :-)

>In other words, if you're going to extend the instruction set to
>usefully support FP, you might as well go the whole hog and truly
>support FP.

Only if you are thinking within the box of modern RISC/x86-like
designs (and, yes, they are more similar than not).


Regards,
Nick Maclaren.
From: Mayan Moudgill on

Hmmm.... Nick, you started off by wanting:
> but what I should like to
> see is a 1,024 core chip, with an enhanced 2-D grid memory topology,
> back-to-back with its NON-shared memory, NO hardware floating-point,
> first-class support for emulation and virtual architectures, and so
> on.

That means that the core which you want has to be small. Assuming a
12x12 chip, this works out to a .35x.35 core. Thats not a lot of gates,
given that you probably want a lot of memory. You're going to have to
try and simplify the structure of the core _a_lot_ to get your 1K cores.

Now, given that, do you really want:
1. 64b registers?
2. multi-port register files?
3. fat instruction sets?

Also, do you really think most people know enough to play fixed
point/block floating point games? Or is this going to be a BLAS only
machine? I suspect you're going to end up with a vanishingly small
number of programmers if your "floating point" can only be made
efficient with programmer intervention, rather than through the compiler.
From: nmm1 on
In article <sOmdnd6erZIJbjvXnZ2dnUVZ_h2dnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>Hmmm.... Nick, you started off by wanting:
>> but what I should like to
>> see is a 1,024 core chip, with an enhanced 2-D grid memory topology,
>> back-to-back with its NON-shared memory, NO hardware floating-point,
>> first-class support for emulation and virtual architectures, and so
>> on.
>
>That means that the core which you want has to be small. Assuming a
>12x12 chip, this works out to a .35x.35 core. Thats not a lot of gates,
>given that you probably want a lot of memory. You're going to have to
>try and simplify the structure of the core _a_lot_ to get your 1K cores.

Yes, obviously. KISS.

>Now, given that, do you really want:
>1. 64b registers?

Not necessarily, at least today. But, if not, GOOD multiple register
handling and ability to code multiple-precision arithmetic is
essential. There's more than one way to skin a cat.

>2. multi-port register files?

As little as possible. KISS.

>3. fat instruction sets?

I have been a supporter of RISC for 35 years. KISS.

>Also, do you really think most people know enough to play fixed
>point/block floating point games? Or is this going to be a BLAS only
>machine? I suspect you're going to end up with a vanishingly small
>number of programmers if your "floating point" can only be made
>efficient with programmer intervention, rather than through the compiler.

Don't be silly. Firstly, you can optimise these aspects as easily
as you can optimise register usage. Secondly, you have missed what
I said about usercode (extracodes, whatever); there is no problem
in providing quite complicated API instructions (e.g. complex
multiply and divide) that way.

Your problem is that you aren't aware of the range of technologies
that have been developed and proven to work, and assume that anything
you don't know about doesn't exist. A lot of what I am proposing
dates from a long way back, and lost out to IBM marketing and its
dominance of process and production. Intel? Who they?


Regards,
Nick Maclaren.