From: "Andy "Krazy" Glew" on
We (comp.arch) have discussed how I perceive the "SIMD" GPU architectures to be truly SIMT, as Nvidia has coined the
term. Or "coherent threaded" or NIMT, in my terminology.

Now let's talk about the microarchitectures of the shader processors.

Nvidia's shaders are apparently strictly scalar.

Intel's seem to be vector.

AMD's are currently VLIW. Apparently AMD evolved through the vector stage, with a 4x32 wide vector unit and a single
math pipe, and then progressed to independent control for each of the 5 pipes, hence VLIW.

AMD's SIMD engines seem to be 16 VLIWs wide => x5 ALUs per VLIW => 80 wide.

Nvidia Fermi's SIMDs are 16 wide (with 2 such in a streaming processor).

Now, the overhead in terms of instruction bits per ALU is roughly the same. Nvidia probably a little bit ahead, because
it doesn't need to have quite so many reduction instructions.

However, the overhead in terms of sequencing logic per ALU is in AMD's VLIW's favor. 1 sequencer per 80 ALUs, vs. 1 per
16. Now, the VLIW sequencer is more complex than the scalar sequencer, but probably not 5X.

However, Nvidia's scalar SIMT will undoubtedly have less fragmentation in the instruction stream. You just can't always
use a 5-wide VLIW.

Why couldn't Nvidia have ganged up 64 or 128 ALUs in a SIMD engine? Well, they could have - but that would have
suffered SIMD fragmentation.

Probably more importantly, this wider SIMD would have led to inefficient use of ALUs wrt memory. If the tiles or
whatever units of memory they are operating on are not fully populated, then SIMD lanes would have to be disabled.

In many ways, memory - the tile width, etc. - probably is more determining of the datapath width in terms of number of
ALUs per SIMD engine. I don't think that it is a random coincidence that both ended up at about 16 wide SIMD,
approximately a 4x4 tile.

I wonder, if you were building a machine specifically for 3D volumetric rendering, if 4x4x4 cubic tiles are more
natural, leading to 64 wide SIMD?

---

I conjecture that something intermediate between 1-wide scalar SIMD and 5-wide VLIW SIMD may be appropriate. I
conjecture that 5-wide may be something of a historocal accident, due to graphics' 4-wide vectors + 1 ALU.

---

In earlier posts I have discussed my quandary wrt intra-cache-line SIMD vs. inter-cache-line. Should you encourage
threads to operate on all of the data in a cache line, as you might want in a MIMD system? But one of the advantages of
SIMD is that neighboring SIMD ALUs can work on neighbouring parts of the same cache line. Thus increasing the ratio of
ALU transistors to memory transistors. Basically more efficient.

How does the scalar-SIMD versus vliw-SIMD (or SIMT) affect this?

Ostensibly, since vliw-SIMD already is more efficient, in the sense that more transistors are spent in ALU than on
sequencing, then it may be more willing to give up some of that efficiency to allow MIMD compatibility. Which is
somewhat counterintuitive: Nvidia at least seems to be more oriented towards generality. And I think that encouraging
code to run transparently on MIMD versus SIMT is more general.

---

In earlier posts I discussed how time vectors could improve SIMT / CT efficiency. This seems to be orthogonal to the
inter vs intra cache line SIMD.