Software vs hardware floating-point [was Re: What happened ...] [Computer Architecture]

Prev: What happened to computer architecture (and comp.arch?)
Next: Parallel huffman encoding and decoding algorithm/idea by Skybuck for sale !

From: ChrisQ on 14 Sep 2009 16:21

nmm1(a)cam.ac.uk wrote:

> Actually, things are getting worse. The problem is that floating-point
> is increasingly being interpreted as IEEE 754, including every frob,
> gizmo and brass knob. And the new version now specifies decimal; if
> that takes off, there will be pressure to provide that, often as well
> as binary - and there are two variants of decimal, too!
>
> IBM say that it adds only 5% to the amount of logic they need, but they
> have a huge floating-point unit in the POWER series. In small chips,
> designed for embedding, it's a massive overhead (perhaps a factor of
> two for binary and three for decimal?) I should appreciate references
> to any hard, detailed information on this.
>
> What is needed is a simplified IEEE 754 binary floating-point, which
> would need less logic, be faster and have better RAS properties. It
> wouldn't even be hard to do - it's been done, many times :-(
>

The last thing I need cluttering up an embedded cpu is floating point
capability. Any math is done fixed point here and then translated to and
from the external world. It's the only way to be confident about the
accuracy. I don't really trust the standard C lib anyway and am even
less likely to trust the float lib, where the sources, even when
available, are probably untidy, uncommented, cryptic and thus opaque :-)...

Regards,

Chris

From: ChrisQ on 14 Sep 2009 16:48

Bernd Paysan wrote:

> Indeed, e.g. LyX, a very friendly front end. Rendering a full book still
> takes a bit of time, however mostly because book authors nowadays put so
> many tricks into LaTeX that it sometimes requires 6 or 7 runs to get it
> all sorted out ;-).
>

"Computer Modern Typefaces", was the book. If you don't have copy, it's
worth paying for just to marvel at the attention to detail, apart from
being interesting in its own right...

Regards,

Chris

From: "Andy "Krazy" Glew" on 19 Sep 2009 21:10

>> I believe you could indeed make a
>> 'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to
>> dedicated hw, but you would have to make sure that _nobody_ except the
>> compiler writers ever needed to be exposed to it.

Trouble is, you need about 3x the instruction fetch/decode/scheduling
bandwidth. Since that is comparable to the actual instruction execution
in terms of power, depending on your machine, it is by no means a clear win.

You would need to be working on a code that allowed nearly all of the FP
"primitive operations" to be optimized away for it to be a win on scalar
code.

On vector code, if the "FP primitive operations" are distributed over a
larger vector, then the amortization of instruction fetch overheads may win.

Anyway, this is nothing new. I investigated this with a mind to
exposing the primitives to the compiler in the P6 era. Trouble is, the
compiler had bigger fish to fry.

From: Brett Davis on 20 Sep 2009 02:01

In article <JY6dnaLYurTZJDvXnZ2dnUVZ_tqdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
> Lets look at the impact of implementing software FP, augmented by the
> necessary HW support.
>
> 1. You give up 1/2 the registers. Typically FP instructions (implicitly)
> use a different register set. This increases the number of names
> available to a compiler, while reducing the orthogonality in the ISA.
> What would you rather have: 32 register that can be used by both FP and
> integer ops, or 32 registers that can be used by FP annd 32 register
> that can be used by integer, with an additional cost to transfer between
> them?

For "best cost/performance" today I would put the FPU in the integer
registers.

The high end will have a wide vector processor that takes care of all
the heavy computing for integer and floating tasks. The sixteen integer
registers of AMD64 is plenty for a bunch of counters and pointers, the
32 registers of a RISC chip is silly/wasteful in this context.

The low end will benefit from having a FPU without the huge costs of
another register set and pipeline. (Twice bigger design with separate
FPU.) 32 registers may be useful here, since they are shared with FPU
ops, except that C/C++ will almost never use 16...

On the really low end you will microcode the FPU ops and share the
single adder and multiplier. Actually you will likely share the
multiplier regardless, it is very expensive real-estate.

As for the half software FPU idea, not a fan of it. Mostly because it
has a tiny niche between no FPU and the microcoded FPU. Not a big enough
market to pay for the hardware design, much less the compiler support.

If just compared to the huge costs of a separate FPU register set and
pipeline, yes it would make sense for a low end design. Also might make
sense as a retrofit to an existing low end design. Though again I would
rewrite the definition of FPU ops to just be in the integer registers.
The instruction set will be different than the existing high end design
with separate FPU, but that is going to be true anyway with the half FPU
ops you invent.

You of course need 64 bit register if you want to support double
precision floats. Most embedded tasks are more than happy with single
precision, so even a 32 bit core would benefit.

FYI: The cost to move between FPU and integer registers can be a dozen
cycles or more, lots more. (Think separate 20 cycle pipes that share
data through the cache...)

Brett

From: nmm1 on 20 Sep 2009 04:16

In article <4AB580FE.404(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>>> I believe you could indeed make a
>>> 'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to
>>> dedicated hw, but you would have to make sure that _nobody_ except the
>>> compiler writers ever needed to be exposed to it.
>
>Trouble is, you need about 3x the instruction fetch/decode/scheduling
>bandwidth. Since that is comparable to the actual instruction execution
>in terms of power, depending on your machine, it is by no means a clear win.

Nobody claims that it is a clear win - certainly neither I nor Terje
would. My assertion is that it would be better, overall, NOT solely
for performance reasons - but no more than that.

And you wouldn't need three times the instruction throughput, except
for highly tuned HPC and benchmarketing. Few 'floating-point' codes
have more than about 10% of their instructions actually executing
floating-point operations. Remember that load and store don't count,
and I said that I would also have a 'direct' comparison operation,
too. When I last measured this (decades ago), it would have needed
very little more instruction throughput, and RISC codes have more
integer operations than the ones I looked at.

>You would need to be working on a code that allowed nearly all of the FP
>"primitive operations" to be optimized away for it to be a win on scalar
>code.

Not so. That would be true for a very few codes, but others would
gain with little or no optimisation. For example, some codes spend
half their time switching between the pipelines (yes, really), and
others are dominated by calls to mathematical functions. By merging
the pipelines, the overheads for the latter could be reduced very
considerably.

Now, working out the winners and losers, and by how much, would be
part of the research project that this proposal would involve.
Nobody is saying that it could be done by waving a magic wand.

>Anyway, this is nothing new. I investigated this with a mind to
>exposing the primitives to the compiler in the P6 era. Trouble is, the
>compiler had bigger fish to fry.

Yup. I never said that it was new - it predates my involvement in
computing, and the reason you say is the reason it has never been
restarted.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: What happened to computer architecture (and comp.arch?)
Next: Parallel huffman encoding and decoding algorithm/idea by Skybuck for sale !