From: hanukas on
On Sep 10, 1:16 pm, Terje Mathisen <Terje.Mathi...(a)tmsw.no> wrote:
> > in floating-point because so few people understand fixed-point
> > numerics nowadays.
>
> All the texture access&filtering, which is responsible for the majority
> of all the flops, is defined using texture quads surrounding the fp
> coordinates of the sampling point.

Just a small clarification. Texture samplers are definitely using
fixed-point when they can get away with it. Smaller precision is
faster. Usually you need a float texture sampler before the float
sampler path is triggered (=slower).



> Using this approach avoids the need for a division per pixel (or small
> group of pixels, as in the original Quake), so it is a real win as long
> as you have enough hardware to handle all four coordinates simultaneously..

The hardware samplers don't really work the way Quake's scanline
dividing innerloop does. It's very common to interpolate barycentric
coordinates divided by vertex w coordinate. Then only one division per
fragment is needed for resolving perspective for all varyings.

It's actually only half a division per fragment, if we take advantage
of common divisors:

Let's say we are doing a/b and c/d, we can combine the divisors: x =
1.0 / b * d

Now we have:

a * c * x and c * d * x

Half a division per pixel. We need more multiplications, though. One
thing to observe here is that:

c = a + constant;
b = d + constant;

This is advantage when you process fragments in blocks. 2x2 block size
or some multiplier of it is very common since the D3D and OGL
specification. When computing mip-level for texture sample the
derivates of the 2x2 block are compared, something like this:

dx = x0 - x1;
dy = y0 - y1;
d = max(dx, dy);
miplevel = log2(d);

Then the fragments that are outside of the primitive are simply
discarded. Not the most optimal use of hw resources for small
primitives. This can be worked around to a degree but the granularity
of fragment computation is determined by this specification. =( At any
rate, this plays well with the two-divisions-for-price-of-one-scheme,
which just saves a few gates: not that big a factor as the number of
samplers is usually limited to fixed cap.
From: nmm1 on
In article <9_ydnas969ZiTDXXnZ2dnUVZ8qOdnZ2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>
>> Yes, but remember that those don't need floating-point, in the first
>> place! Almost all GPU algorithms are fixed-point ones, implemented
>
>Not any more. GPUs switched to using fp internally _before_ they ever
>exposed this to the end users/developers, simply because it made sense
>for the GPU vendors, and they already had working fixed-point firmware
>for previous generations of cards.

Eh? The reason they switched was NOT because the algorithms weren't
fixed-point ones, but because their new 'computer science' employees
didn't have a clue about scaling. Few people under 70 do :-(

>> in floating-point because so few people understand fixed-point
>> numerics nowadays.
>
>All the texture access&filtering, which is responsible for the majority
>of all the flops, is defined using texture quads surrounding the fp
>coordinates of the sampling point.

Sorry, Terje, but that's a fixed-point algorithm, not a floating-point
one. You DO realise that I don't mean 'integer' by 'fixed-point',
don't you?

>Using this approach avoids the need for a division per pixel (or small
>group of pixels, as in the original Quake), so it is a real win as long
>as you have enough hardware to handle all four coordinates simultaneously.

Well, yes, but would you like to show me some code or pseudo-code
that can be implemented using floating-point but not fixed-point?

>Yes, this _could_ have been handled with fixed-point, but since you need
>10 bits of fractional precision, you'd still end up with 32-bit chunks,
>and you'd have to be very careful to avoid overflows.

There's a problem with that? SOP, as far as I am concerned.

>Another place where fp really helps is when doing gamma-corrected
>blending/sampling. I'm guessing anisotropic filtering is in the same group.

I am not familiar in detail with gamma correction or anisotropic
filtering, but I think that they fall into the (large) class of
algorithms where floating-point helps by making it possible for
people who don't have a clue about scaling to write more-or-less
working code. What was I saying about 'computer scientists'?

>> Realistically, it's that aspect that kills my idea, not the actual
>> architectural ones. It doesn't matter how good the engineering is
>> if the customers won't adopt it.
>
>I believe you could indeed make a
>'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to
>dedicated hw, but you would have to make sure that _nobody_ except the
>compiler writers ever needed to be exposed to it.

Nah. You are forgetting yourself, me and other authors of the
intrinsic numerical functions and auxiliary code (e.g. complex
division) :-) But, with that niggle, I agree.


Regards,
Nick Maclaren.
From: Ken Hagan on
On Thu, 10 Sep 2009 11:16:30 +0100, Terje Mathisen
<Terje.Mathisen(a)tmsw.no> wrote:

> Another place where fp really helps is when doing gamma-corrected
> blending/sampling.

How? I'm surprised because I'd have thought a pre-computed LUT was far and
away the fastest solution for 8-bit images (12-bit once linearised),
despite its cache-busting tendencies. (That said, since you haven't spent
a zillion transistors on hardware FP, perhaps your cache is larger. :)
From: Terje Mathisen on
hanukas wrote:
> On Sep 10, 1:16 pm, Terje Mathisen<Terje.Mathi...(a)tmsw.no> wrote:
>>> in floating-point because so few people understand fixed-point
>>> numerics nowadays.
>>
>> All the texture access&filtering, which is responsible for the majority
>> of all the flops, is defined using texture quads surrounding the fp
>> coordinates of the sampling point.
>
> Just a small clarification. Texture samplers are definitely using
> fixed-point when they can get away with it. Smaller precision is
> faster. Usually you need a float texture sampler before the float
> sampler path is triggered (=slower).

Thanks!

My exposure to this field has been in form of writing some code for
parts of what was going to become a full DX9 sw rasterizer.

>> Using this approach avoids the need for a division per pixel (or small
>> group of pixels, as in the original Quake), so it is a real win as long
>> as you have enough hardware to handle all four coordinates simultaneously.
>
> The hardware samplers don't really work the way Quake's scanline
> dividing innerloop does. It's very common to interpolate barycentric
> coordinates divided by vertex w coordinate. Then only one division per
> fragment is needed for resolving perspective for all varyings.
>
> It's actually only half a division per fragment, if we take advantage
> of common divisors:
>
> Let's say we are doing a/b and c/d, we can combine the divisors: x =
> 1.0 / b * d
>
> Now we have:
>
> a * c * x and c * d * x
>
> Half a division per pixel. We need more multiplications, though. One
> thing to observe here is that:
>
> c = a + constant;
> b = d + constant;
>
> This is advantage when you process fragments in blocks. 2x2 block size
> or some multiplier of it is very common since the D3D and OGL
> specification. When computing mip-level for texture sample the
> derivates of the 2x2 block are compared, something like this:
>
> dx = x0 - x1;
> dy = y0 - y1;
> d = max(dx, dy);
> miplevel = log2(d);

Yeah, this looks similar to the code I remember from a couple of years ago.
>
> Then the fragments that are outside of the primitive are simply
> discarded. Not the most optimal use of hw resources for small
> primitives. This can be worked around to a degree but the granularity
> of fragment computation is determined by this specification. =( At any
> rate, this plays well with the two-divisions-for-price-of-one-scheme,
> which just saves a few gates: not that big a factor as the number of
> samplers is usually limited to fixed cap.

OK.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen on
nmm1(a)cam.ac.uk wrote:
>> I believe you could indeed make a
>> 'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to
>> dedicated hw, but you would have to make sure that _nobody_ except the
>> compiler writers ever needed to be exposed to it.
>
> Nah. You are forgetting yourself, me and other authors of the
> intrinsic numerical functions and auxiliary code (e.g. complex
> division) :-) But, with that niggle, I agree.

The math library writers are part of the compiler team, particularly
since you would want many (most?) of these functions to be intrinsic to
the compiler, so that it could unroll & interleave as many of them as
would fit in the register set.

I.e. not just a callable math library, but I assume this has been part
of your suggestion from the start?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"