Software vs hardware floating-point [was Re: What happened ...] [Computer Architecture]

Prev: What happened to computer architecture (and comp.arch?)
Next: Parallel huffman encoding and decoding algorithm/idea by Skybuck for sale !

From: Terje Mathisen on 10 Sep 2009 09:11

Ken Hagan wrote:
> On Thu, 10 Sep 2009 11:16:30 +0100, Terje Mathisen
> <Terje.Mathisen(a)tmsw.no> wrote:
>
>> Another place where fp really helps is when doing gamma-corrected
>> blending/sampling.
>
> How? I'm surprised because I'd have thought a pre-computed LUT was far
> and away the fastest solution for 8-bit images (12-bit once linearised),
> despite its cache-busting tendencies. (That said, since you haven't
> spent a zillion transistors on hardware FP, perhaps your cache is
> larger. :)

You might well be right, in that we already have 12bit accurate parallel
invsqrt() and reciprocal() lookup functions.

The gamma stuff needs to be 10-bit accurate afair from the DX9/10 specs.

I do remember that for tri-lin sampling the log2() function needs to
have at least 10 fractional bits correct, I was able to discover/invent
a fast SIMD style code that managed this without any lookup tables.

I would not have been able to do the same without fp hardware!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: nmm1 on 10 Sep 2009 09:25

In article <376dneGNbOn-ZDXXnZ2dnUVZ8jGdnZ2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>
>>> I believe you could indeed make a
>>> 'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to
>>> dedicated hw, but you would have to make sure that _nobody_ except the
>>> compiler writers ever needed to be exposed to it.
>>
>> Nah. You are forgetting yourself, me and other authors of the
>> intrinsic numerical functions and auxiliary code (e.g. complex
>> division) :-) But, with that niggle, I agree.
>
>The math library writers are part of the compiler team, particularly
>since you would want many (most?) of these functions to be intrinsic to
>the compiler, so that it could unroll & interleave as many of them as
>would fit in the register set.

Er, no, they're not! Not in practice, and not entirely in theory. At
most the compiler team will take responsibility for the most important
of the standard intrinsics.

'Tain't true for gcc and glibc, for example.

>I.e. not just a callable math library, but I assume this has been part
>of your suggestion from the start?

Yes. It's standard compiler technology, so I was assuming that it
would be done by the teams that are most concerned about performance.

Regards,
Nick Maclaren.

From: nmm1 on 10 Sep 2009 09:38

In article <wOmdnaYJINxgZzXXnZ2dnUVZ8jidnZ2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>
>I do remember that for tri-lin sampling the log2() function needs to
>have at least 10 fractional bits correct, I was able to discover/invent
>a fast SIMD style code that managed this without any lookup tables.

No floating-point hardware is needed for implementing fast, efficient
logarithm functions - indeed, most implementations do most of their
work in fixed point!

number = mantissa * 2^exponent
logarithm = exponent + polynomial(mantissa)

Working out a suitable polynomial is easy, and the scaling can be (and
usually is) done statically. There really isn't a problem.

As someone posted, you can vary the precision of the multiplications,
as the higher-order ones are needed to less accuracy. In fact, you
can do an initial linear approximation using only shifting and
addition, to ensure that all multiplications use only a few bits.

number = mantissa * 2^exponent
logarithm = exponent + constant + k*mantissa + polynomial(mantissa)

where k is a convenient number like 5/8.

Regards,
Nick Maclaren.

From: Terje Mathisen on 10 Sep 2009 15:39

nmm1(a)cam.ac.uk wrote:
> In article<376dneGNbOn-ZDXXnZ2dnUVZ8jGdnZ2d(a)lyse.net>,
> Terje Mathisen<Terje.Mathisen(a)tmsw.no> wrote:

>> The math library writers are part of the compiler team, particularly
>> since you would want many (most?) of these functions to be intrinsic to
>> the compiler, so that it could unroll& interleave as many of them as
>> would fit in the register set.
>
> Er, no, they're not! Not in practice, and not entirely in theory. At

Sorry, I meant that the math code needs to be part of the compiler team,
for _this_ particular architecture.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: nmm1 on 10 Sep 2009 15:45

In article <oeSdnTfEupOWyzTXnZ2dnUVZ8i5i4p2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>nmm1(a)cam.ac.uk wrote:
>> In article<376dneGNbOn-ZDXXnZ2dnUVZ8jGdnZ2d(a)lyse.net>,
>> Terje Mathisen<Terje.Mathisen(a)tmsw.no> wrote:
>
>>> The math library writers are part of the compiler team, particularly
>>> since you would want many (most?) of these functions to be intrinsic to
>>> the compiler, so that it could unroll& interleave as many of them as
>>> would fit in the register set.
>>
>> Er, no, they're not! Not in practice, and not entirely in theory. At
>
>Sorry, I meant that the math code needs to be part of the compiler team,
>for _this_ particular architecture.

Oh, right. Yes. I misunderstood you.

But my point that the facilities would be useful to other people as
well stands. My experience of writing numeric functions is that a
good half of my problems arose from the fact that I couldn't get at
operations that I knew perfectly well were there in the hardware.

For all of performance, accuracy and exception handling, too.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: What happened to computer architecture (and comp.arch?)
Next: Parallel huffman encoding and decoding algorithm/idea by Skybuck for sale !