From: "Andy "Krazy" Glew" on
Terje Mathisen wrote:
> It seems to be a bandwidth problem much more than a flops, i.e. using
> the texture units effectively was the key to the big wins.

I am puzzled, torn, wondering about the texture units.

There's texture memory, which is just a slightly funky form of
cache/memory, with funky locality, and possible compression.

But then there's also the texture computation capability. Which is just
a funky form of 2 or 3D interpolation.

Most people seem to be getting the benefit from texture memory. But
when people use the texture interpolation compute capabilities, there's
another kicker.

Back in the 1990s on P6, when I was trying to make the case for CPUs to
own the graphics market, and not surrender it to the only-just-nascent
GPUs, the texture units were the oinker: they are just so damned
necessary to graphics, and they are just so damned idiosyncratic. I do
not know of any good way to do texturing in software that doesn't lose
performance, or of any good way to decompose texturing into simpler
instruction set primitives that could reasonably be added to an
instruction set. E.g. I don't know of any good way to express texture
operations in terms of 2, or 3, or even 4, register inputs.

Let's try again: how about an interpolate instruction that takes 3
vector registers, and performs interpolation between tuples in the X
direction along the length of the register, and in the Y generation
between correspondng elements in different registers?

But do you want 2, or 3, or 4, or ... arguments to interpolate along?
And what about Z interpolation?

Let alone compression? And skewed sampling? And ...

Textures just seem to be this big mass of stuff, all of which has to be
done in order to be credible.

Although I usually try to decompose complex things into simpler
operations, sometimes it is necessary to go the other way. Maybe we can
make the texture units more general. Make them into generally useful
function interpolation units. Add that capability to general purpose CPUs.

How much of the benefit is texture computation vs texture memory? Can
we separate these two things?

Texture computation is interpolation. (Which, of course, often
translates to memory savings because it changes the amount of memory you
need for lookup tables - higher order interpolation, or multiscale
interpolation => less memory traffic.) It looks like this can be made
general purpose. But how many people need it?

Texture memory is ... a funky sort of cache, with compression. Caches we
can make generically useful. Compression - for read-only data
structures, sure. But how can we write INTO the "compressed texture
cache memory", in such a way that we don't blow out the compression when
in gets kicked out of the cache?

Or, can we safely create a hardware datastructure that is mainly useful
for caching readonly, heaviliy preprocessed, data?

It seems to me that most of the GPGPU codes are not using the compute or
compression aspects of texture units. Indeed, CUDA doesn't really give
access to that. So it is probably just the extra memory ports and cache
behavior.

--

Terje, you're the master of lookup tables. Can you see a way to make
texture units generally useful?



From: Andrew Reilly on
On Fri, 11 Dec 2009 07:24:36 -0800, Andy \"Krazy\" Glew wrote:

> Back in the 1990s on P6, when I was trying to make the case for CPUs to
> own the graphics market, and not surrender it to the only-just-nascent
> GPUs, the texture units were the oinker: they are just so damned
> necessary to graphics, and they are just so damned idiosyncratic. I do
> not know of any good way to do texturing in software that doesn't lose
> performance, or of any good way to decompose texturing into simpler
> instruction set primitives that could reasonably be added to an
> instruction set. E.g. I don't know of any good way to express texture
> operations in terms of 2, or 3, or even 4, register inputs.

Isn't that a fairly damning argument against Larabee, as a general-
purpose graphics part? Or did Larabee have equivalent texture units
bolted on to the side of their Atom-ish cores?

Cheers,

--
Andrew
From: "Andy "Krazy" Glew" on
Andrew Reilly wrote:
> On Fri, 11 Dec 2009 07:24:36 -0800, Andy \"Krazy\" Glew wrote:
>
>> Back in the 1990s on P6, when I was trying to make the case for CPUs to
>> own the graphics market, and not surrender it to the only-just-nascent
>> GPUs, the texture units were the oinker: they are just so damned
>> necessary to graphics, and they are just so damned idiosyncratic. I do
>> not know of any good way to do texturing in software that doesn't lose
>> performance, or of any good way to decompose texturing into simpler
>> instruction set primitives that could reasonably be added to an
>> instruction set. E.g. I don't know of any good way to express texture
>> operations in terms of 2, or 3, or even 4, register inputs.
>
> Isn't that a fairly damning argument against Larabee, as a general-
> purpose graphics part? Or did Larabee have equivalent texture units
> bolted on to the side of their Atom-ish cores?

Where did you get your information about Larrabee?

Wikipedia (http://en.wikipedia.org/wiki/Larrabee_%28GPU%29) says
(as of the time I am posting this):

Larrabee's x86 cores will be based on the much simpler Pentium P54C design

Larrabee includes one major fixed-function graphics hardware feature:
texture sampling units. These perform trilinear and anisotropic
filtering and texture decompression.

The following seems to be the standard reference for Larrabee:

http://software.intel.com/file/2824/

Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey,
P., Junkins, S., Lake, A., Sugerman,
J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P. 2008.
Larrabee: A Many–Core x86
Architecture for Visual Computing. ACM Trans. Graph. 27, 3, Article 18
(August 2008), 15 pages. DOI =
10.1145/1360612.1360617 http://doi.acm.org/10.1145/1360612.1360617.

I like their quote on texture units:

Larrabee includes texture filter logic because this operation
cannot be efficiently performed in software on the cores. Our
analysis shows that software texture filtering on our cores would
take 12x to 40x longer than our fixed function logic, depending on
whether decompression is required. There are four basic reasons:
• Texture filtering still most commonly uses 8-bit color
components, which can be filtered more efficiently in
dedicated logic than in the 32-bit wide VPU lanes.
• Efficiently selecting unaligned 2x2 quads to filter requires a
specialized kind of pipelined gather logic.
• Loading texture data into the VPU for filtering requires an
impractical amount of register file bandwidth.
• On-the-fly texture decompression is dramatically more
efficient in dedicated hardware than in CPU code.
The Larrabee texture filter logic is internally quite similar to
typical GPU texture logic. It provides 32KB of texture cache per
core and supports all the usual operations, such as DirectX 10
compressed texture formats, mipmapping, anisotropic filtering,
etc. Cores pass commands to the texture units through the L2
cache and receive results the same way. The texture units perform
virtual to physical page translation and report any page misses to
the core, which retries the texture filter command after the page is
in memory. Larrabee can also perform texture operations directly
on the cores when the performance is fast enough in software
From: Andrew Reilly on
On Fri, 11 Dec 2009 22:02:51 -0800, Andy \"Krazy\" Glew wrote:

> Where did you get your information about Larrabee?

Only here. I don't recall it coming up, before. I'm not all that
interested in specialized graphics pipelines. Thanks for the great quote!

Cheers,

--
Andrew
From: "Andy "Krazy" Glew" on
Andrew Reilly wrote:
> On Fri, 11 Dec 2009 22:02:51 -0800, Andy \"Krazy\" Glew wrote:
>
>> Where did you get your information about Larrabee?
>
> Only here. I don't recall it coming up, before. I'm not all that
> interested in specialized graphics pipelines. Thanks for the great quote!

I guess that part of the reason for this conversation is...

Although I *am* interested in specialized graphics functions

I am much more interested in operations that are of general use.

If you think of texture units as a generalized interploation and cache
with compression, then we can think of areas of more general use.