From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:


> Your problem is that you aren't aware of the range of technologies
> that have been developed and proven to work, and assume that anything
> you don't know about doesn't exist. A lot of what I am proposing
> dates from a long way back, and lost out to IBM marketing and its
> dominance of process and production. Intel? Who they?
>
Enlighten us. How would you implement software fp efficiently? In
particular, what code would you have us generate for

for( i = 0; i < N; i++ ) {
Z[i] = A[i]*B[i] - C[i];
if( Z[i] < D[i] ) {
Z[i] = Z[i] + E[i];
}
}
(I picked a loop with a multiply, a subtract, an add and a compare )

Feel free to specify non-IEEE formats, including 32bit exponent and 32
bit signed mantissa (i.e. use a 32 bit register for exponent and a 32
bit register for mantissa).
From: nmm1 on
In article <OMOdne9eFPwDSjrXnZ2dnUVZ_tGdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>> Your problem is that you aren't aware of the range of technologies
>> that have been developed and proven to work, and assume that anything
>> you don't know about doesn't exist. A lot of what I am proposing
>> dates from a long way back, and lost out to IBM marketing and its
>> dominance of process and production. Intel? Who they?
>>
>Enlighten us. How would you implement software fp efficiently? In
>particular, what code would you have us generate for
>
> for( i = 0; i < N; i++ ) {
> Z[i] = A[i]*B[i] - C[i];
> if( Z[i] < D[i] ) {
> Z[i] = Z[i] + E[i];
> }
> }
>(I picked a loop with a multiply, a subtract, an add and a compare )
>
>Feel free to specify non-IEEE formats, including 32bit exponent and 32
>bit signed mantissa (i.e. use a 32 bit register for exponent and a 32
>bit register for mantissa).

I am beginning to wonder whether you are posting seriously, or
beginning to troll. In the hope that you are not, I shall respond.

Firstly, I never denied that there was SOME loss of efficiency,
but said that the losses were offset by gains with code that is
currently a problem and need not be. You have chosen an example
that provides minimal opportunities for the gains.

Secondly, by breaking the floating-point operations down into
multiple stages, you provide much more opportunity for the compiler
to optimise for multiple issue, pipelining, parallelism and even
VLIW. Again, you have chosen an example that doesn't expose that
serious problem with existing architectures.

Thirdly, the only fundamental difference between a hardware and
software approach is that the former is invoked by a single ABI
instruction and the latter is not. You may not remember the days
when floating-point was often implemented in firmware, but those
systems had fast floating-point.

The last is perhaps the key. You optimise software floating-point
in almost exactly the same way that you optimise firmware floating-
point, which is very similar (and often identical) to the way that
you optimise multi-stage hardware floating-point. The ONLY real
difference is that the granularity is finer, and you are exposing
the stages of the floating-point operation.

Think of my proposal as firmware revisited. You may understand it
better. And, if you think that firmware was necessarily slow,
you haven't been keeping up. Even Seymour Cray couldn't get more
than a factor of about two, and his cavalier attitude to accuracy
accounted for as much of his speed gains as his use of hardware.


Before you can ask how I can optimise it, you need to explain why
doing that is necessarily much slower than hiding the stages behind
a single instruction. Because those stages assuredly exist, and
are necessarily serialised, in most floating-point unit designs.
Though a lot of the complexity of floating-point hardware is to
try to hide that.

Take a look at the number of multiple-issue and VLIW designs that
had and have restrictions on the number of instructions that can
be floating-point, and the mind-blowing length of some of the
pipelines. It wasn't and isn't easy to get those to run at full
speed, when faced with an arbitrary set of code.

No, my proposal doesn't solve that. If it did, it would have been
adopted. But that problem is the reason that exposing the underlying
stages is feasible without as much loss of efficiency as you think.
And remember that it isn't rare for an HPC system to spend 2-5%
of its 'CPU time' actually executing arithmetic instructions. Yes,
really. So there is a hell of a lot of slack.


Now, if you start to bring in the DSP and similar designs where
they ARE dominated by the arithmetic unit, you will first have to
explain why they need floating-point in the place. And remember that,
while I did very little of it, I came in at the end of the period
when using fixed-point for heavyweight numerics was common.


Regards,
Nick Maclaren.
From: Terje Mathisen on
Mayan Moudgill wrote:
> nmm1(a)cam.ac.uk wrote:
>
>
>> Your problem is that you aren't aware of the range of technologies
>> that have been developed and proven to work, and assume that anything
>> you don't know about doesn't exist. A lot of what I am proposing
>> dates from a long way back, and lost out to IBM marketing and its
>> dominance of process and production. Intel? Who they?
>>
> Enlighten us. How would you implement software fp efficiently? In
> particular, what code would you have us generate for
>
> for( i = 0; i < N; i++ ) {
> Z[i] = A[i]*B[i] - C[i];
> if( Z[i] < D[i] ) {
> Z[i] = Z[i] + E[i];
> }
> }
> (I picked a loop with a multiply, a subtract, an add and a compare )
>
> Feel free to specify non-IEEE formats, including 32bit exponent and 32
> bit signed mantissa (i.e. use a 32 bit register for exponent and a 32
> bit register for mantissa).

With sufficient integer registers available, the code above can run very
fast indeed, since each iteration is independent of the previous.

I.e. you (or I) can pipeline it as much as the register set allows.

Since Nick specifies that the critical building blocks should be
exposed, I'll assume I have a normalize which returns the number of bits
shifted (positive or negative):

a_mant = a[i].mant; a_exp = a[i].exp;
b_mant = b[i].mant; b_exp = b[i].exp;
c_mant = c[i].mant; c_exp = c[i].exp;
d_mant = d[i].mant; d_exp = d[i].exp;
e_mant = e[i].mant; e_exp = e[i].exp;

// FMUL, double-wide non-normalized result:
mant2 m = (mant2) a_mant * (mant2) b_mant;
e = a_exp + b_exp;

// FSUB
expdiff = e - c_exp;
m_adj = max(0, -expdiff);
// Shifts larger than the mantissa length will always return 0
m >>= m_adj; e += m_adj;
c_adj = max(0, expdiff);
c_mant >>= c_adj;
m -= c_mant;

// Assume the compare will be OK, so calculate Z+E
// FADD
expdiff = e - e_exp;
m_adj = max(0, -expdiff);
zm = m >> m_adj; ze = e + m_adj;
e_adj = max(0, expdiff);
e_mant >>= e_adj;
zm += e_mant;

// Normalize Z (e,m) in order to do the compare:
e -= normalize(m);

larger = (e > d_exp) || ((e == d_exp) && (m > d_mant);

cmov(m, zm, larger);
cmov(e, ze, larger);

etc., for a total of ~30 operations.

The key is that the latency of a single iteration is pretty bad, and
there's no attempt on IEEE rounding, and my sign handling is
non-existent (add a couple more building block opcodes), but with enough
execution units an arbitrary number of these operations can run in parallel.

With Nick's 1K cores, each with 32 or 64 registers, it seems like you
could get ~32 results every cycle, with a single execution unit in each
core.

OTOH, having just slightly more complicated building blocks would double
this, for a real win in power efficiency.

I.e. I don't really believe Nick is right:

FP math has such large usage that it does make sense to dedicate
specialized hardware for it, although many (most these days?) can make
do with well below IEEE float standard implementations: I'm thinking of
all the flops taken by the world's GPU population.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <YIydnbaWAq0glTXXnZ2dnUVZ8kednZ2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>
>etc., for a total of ~30 operations.
>
>The key is that the latency of a single iteration is pretty bad, and
>there's no attempt on IEEE rounding, and my sign handling is
>non-existent (add a couple more building block opcodes), but with enough
>execution units an arbitrary number of these operations can run in parallel.

Yes, but you are using existing software primitives, and that is NOT
what I am proposing!

>With Nick's 1K cores, each with 32 or 64 registers, it seems like you
>could get ~32 results every cycle, with a single execution unit in each
>core.
>
>OTOH, having just slightly more complicated building blocks would double
>this, for a real win in power efficiency.

And THAT is far closer to what I am proposing. My estimate is that
the right scale of division is to split floating-point operations
into 5-10 steps, many of which would be very useful for implementing
numeric functions and/or extended precision integers.

>I.e. I don't really believe Nick is right:
>
>FP math has such large usage that it does make sense to dedicate
>specialized hardware for it, although many (most these days?) can make
>do with well below IEEE float standard implementations: I'm thinking of
>all the flops taken by the world's GPU population.

Yes, but remember that those don't need floating-point, in the first
place! Almost all GPU algorithms are fixed-point ones, implemented
in floating-point because so few people understand fixed-point
numerics nowadays.

Realistically, it's that aspect that kills my idea, not the actual
architectural ones. It doesn't matter how good the engineering is
if the customers won't adopt it.


Regards,
Nick Maclaren.
From: Terje Mathisen on
nmm1(a)cam.ac.uk wrote:
>> FP math has such large usage that it does make sense to dedicate
>> specialized hardware for it, although many (most these days?) can make
>> do with well below IEEE float standard implementations: I'm thinking of
>> all the flops taken by the world's GPU population.
>
> Yes, but remember that those don't need floating-point, in the first
> place! Almost all GPU algorithms are fixed-point ones, implemented

Not any more. GPUs switched to using fp internally _before_ they ever
exposed this to the end users/developers, simply because it made sense
for the GPU vendors, and they already had working fixed-point firmware
for previous generations of cards.

> in floating-point because so few people understand fixed-point
> numerics nowadays.

All the texture access&filtering, which is responsible for the majority
of all the flops, is defined using texture quads surrounding the fp
coordinates of the sampling point.

Using this approach avoids the need for a division per pixel (or small
group of pixels, as in the original Quake), so it is a real win as long
as you have enough hardware to handle all four coordinates simultaneously.

Yes, this _could_ have been handled with fixed-point, but since you need
10 bits of fractional precision, you'd still end up with 32-bit chunks,
and you'd have to be very careful to avoid overflows.

Another place where fp really helps is when doing gamma-corrected
blending/sampling. I'm guessing anisotropic filtering is in the same group.

> Realistically, it's that aspect that kills my idea, not the actual
> architectural ones. It doesn't matter how good the engineering is
> if the customers won't adopt it.

I believe you could indeed make a
'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to
dedicated hw, but you would have to make sure that _nobody_ except the
compiler writers ever needed to be exposed to it.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"