From: NV55 on
http://www.beyond3d.com/content/reviews/51

NVIDIA GT200 GPU and Architecture Analysis

Published on 16th Jun 2008, written by Rys for Consumer Graphics -
Last updated: 15th Jun 2008

Introduction

Sorry G80, your time is up.

There's no arguing that NVIDIA's flagship D3D10 GPU has held a reign
over 3D graphics that never truly saw it usurped, even by G92 and a
dubiously named GeForce 9-series range. The high-end launch product
based on G80, GeForce 8800 GTX, is still within spitting distance of
anything that's come out since in terms of raw single-chip
performance. It flaunts its 8 clusters, 384-bit memory bus and 24 ROPs
in the face of G92, meaning that products like 9800 GTX have never
really felt like true upgrades to owners of G80-based products.

That I type this text on my own PC powered by a GeForce 8800 GTX, one
that I bought -- which is largely unheard of in the world of tech
journalism; as a herd, we never usually buy PC components -- with my
own hard-earned, and on launch day no less, speaks wonders for the
chip's longevity. I'll miss you old girl, your 20 month spell at the
top of the pile is now honestly up. So what chip the usurper, and how
far has it moved the game on?

Rumours about GT200 have swirled for some time, and recently the
rumour mill has mostly got it right. The basic architecture is pretty
much a known quantity at this point, and it's a basic architecture
that shares a lot of common ground with the one powering the chip
we've just eulogised. Why mess too much with what's worked so well,
surely? "Correctamundo", says the Fonz, and the Fonz is always right.

It's all about the detail now, so we'll try and reveal as much as
possible to see where the deviance can be found. We'll delve into the
architecture first, before taking a look at the first two products it
powers, looking back to previous NVIDIA D3D10 hardware as necessary to
paint the picture.


NVIDIA GT200 Overview

The following diagram represents a high-level look at how GT200 is
architected and what some of the functional units are capable of. It's
a similar chip to G80, of that there's no doubt, but the silicon
surgery undertaken by NVIDIA's architects to create it means we have
quite a different beast when you take a look under the surface.


http://www.beyond3d.com/images/reviews/gt200-arch/GT200-full-1.2-26-05-08.png


If it's not clear from the above diagram, like G80, GT200 is a fully-
unified, heavily-threaded, self load-balancing (full time, agnostic of
API) shading architecture. It has decoupled and threaded data
processing, allowing the hardware to fully realise the goal of hiding
sampler latency by scheduling sampler threads independently of, and
asynchronously with, shading threads.

The design goals of the chip appear to be the improvement of D3D10
performance in general, especially at the Geometry Shader stage, with
the end result presumably as close to doubling the performance of a
similarly clocked G92 as possible. There's not 2x the raw performance
available everywhere on the chip of course, but the increase in
certain computation resources should see it achieve something like
that in practice, depending on what's being rendered or computed.

Let's look closer at the chip architecture, then. The analysis was
written with our original look at G80 in mind. The architecture we
discussed there is the basis for what we'll talk about today, so have
a good read of that to refresh your memory, and/or ask in the forums
if anything doesn't make sense. The original piece is a little
outdated in places, as we've discovered more about the chip as time
goes by over the last year and a half, so just ask about or let us
know about something that doesn't quite fit.


GT200: The Shading Core

http://www.beyond3d.com/images/reviews/gt200-arch/shader-core.png

GT200 demonstrates subtle yet distinct architectural differences when
compared to G80, the chip that pioneered the basic traits of this
generation of GPUs from Kirk and Co. As we've alluded to, G80 led a
family of chips that have underpinned the company's dominance over AMD
in the graphics space since its launch, so it's no surprise to see
NVIDIA stick to the same themes of execution, use of on-chip memories,
and approach to acceleration of graphics and non-graphics computation.

At its core, GT200 is a MIMD array of SIMD processors, partitioned
into what we call clusters, with each cluster a 3-way collection of
shader processors which we call an SM. Each SM, or streaming
multiprocessor, comprises 8 scalar ALUs, with each capable of FP32 and
32-bit integer computation (the only exception being multiplication,
which is INT24 and therefore still takes 4 cycles for INT32), a single
64-bit ALU for brand new FP64 support, and a discrete pool of shared
memory 16KiB in size.

The FP64 ALU is notable not just in its inclusion, NVIDIA supporting
64-bit computation for the first time in one of its graphics
processors, but in its ability. It's capable of a double precision MAD
(or MUL or ADD) per clock, supports 32-bit integer computation, and
somewhat surprisingly, signalling of a denorm at full speed with no
cycle penalty, something you won't see in any other DP processor
readily available (such as any x86 or Cell). The ALU uses the MAD to
accelerate software support for specials and divides, where possible.

Those ALUs are paired with another per-SM block of computation units,
just like G80, which provide scalar interpolation of attributes for
shading and a single FP-only MUL ALU. That lets each SM potentially
dual-issue 8 MAD+MUL instruction pairs per clock for general shading,
with the MUL also assisting in attribute setup when required.
However, as you'll see, that dual-issue performance depends heavily on
input operand bandwidth.

Each warp of threads still runs for four clocks per SM, with up to
1024 threads managed per SM by the scheduler (which has knock-on
effects for the programmer when thinking about thread blocks per
cluster). The hardware still scales back threads in flight if there's
register pressure of course, but that's going to happen less now the
RF has doubled in size per SM (and it might happen more gracefully now
to boot).

So, along with that pool of shared memory is connection to a per-SM
register file comprising 16384 32-bit registers, double that available
for each SM in G80. Each SP in each SM runs the same instruction per
clock as the others, but each SM in a cluster can run its own
instruction. Therefore in any given cycle, SMs in a cluster are
potentially executing a different instruction in a shader program in
SIMD fashion. That goes for the FP64 ALU per SM too, which could
execute at the same time as the FP32 units, but it shares datapaths to
the RF, shared memory pools, and scheduling hardware with them so the
two can't go full-on at the same time (presumably it takes the place
of the MUL/SFU, but perhaps it's more flexible than that). Either way,
it's not currently exposed outside of CUDA or used to boost FP32
performance.

That covers basic execution across a cluster using its own memory
pools. Across the shader core, each SM in each cluster is able to run
a different instruction for a shader program, giving each SM its own
program counter, scheduling resources, and discrete register file
block. A processing thread started on one cluster can never execute on
any other, although another thread can take its place every cycle. The
SM schedulers implement execution scoreboarding and are fed from the
global scheduler and per thread-type setup engines, one for VS, one
for GS and one for PS threads.
From: NV55 on

GT200: Sampling and the ROP

http://www.beyond3d.com/images/reviews/gt200-arch/tpc.png

For data fetch and filtering, each cluster is connected to its own
discrete sampler unit (with cluster + samplers called the texture
processing cluster or TPC by NVIDIA), with each one able to calculate
8 sample addresses and bilinearly filter 8 samples per clock. That's
unchanged compared to G92, but it's worth pointing out that prior
hardware could never reach the bilinear peak outside of (strangely
enough) scalar FP32 textures. It's now obtainable (or at least much
closer) thanks to, according to NVIDIA, tweaks to the thread scheduler
and sampler I/O. We still heavily suspect though that one of the key
reasons is additional shared INT16 hardware for what we imagine
actually is a shared addressing/filtering unit. Either way, each
sampler has a dedicated L1 cache which is likely 16KiB and all sampler
units share a global L2 cache that we believe is double the size of
that in G80 at 256KiB. The sampler hardware runs at the chip base
clock, whereas the shading units run at the chip hot clock, which is
most easily thought of as being 2x the scheduler clock. Along with the
memory clock, those mentioned clocks comprise the main domains in
GT200, just like they did in G80.

The hardware is advertised as supporting D3D10.0, since its
architecture is marginally incapable of supporting 10.1, by virtue of
the ROP hardware. D3D10 compliance means the ability in hardware for
recycling data from GS stage of the computation model back through the
chip for another pass. The output buffer for that is six times larger
in GT200 than in G80, although NVIDIA don't disclose the exact size.
Given that the GS stage is capable of data amplification (and de-
amplification of course), the increased buffer size represents a
significant change in what the architecture is capable of in a
performance sense, if not a theoretical sense. The same per-thread
output limits are present, but now more GS threads can now be run at
the same time.

That covers the changes to on-chip memories that each cluster has
access to. Quickly returning to the front of the chip, It appears that
the hardware can still only setup a single triangle per clock, and the
rasteriser is largely unchanged. Remember that in G80, the rasteriser
worked on 32 pixel blocks, correlating to the pixel batch size. GT200
continues to work on the same size pixel blocks as it sends the screen
down through the clusters as screen tiles for shading.

http://www.beyond3d.com/images/reviews/g80-arch/g80-quad-rop.png

At the back of the chip, after computation via each TPC, the same
basic ROP architecture as G80 is present. With the external memory bus
512 bits wide this time and each 64-bit memory channel serving a ROP
partition, that means 8 ROP partitions, each partition housing a
quartet of ROP units. 32 in total then. Each ROP is now capable of a
full-speed INT8 or FP16 channel blend per cycle, whereas G80 needed
two cycles to complete the same operations. This guarantees that
blending isn't ROP limited, which could already be the case on G80 and
would have become even more of a problem with a higher memory/core
clock ratio. It might also initially seem odd that FP16 is also
supported at full-speed despite being certainly bandwidth limited, but
remember that full-speed FP16 also means that 32-bit floating point
pixels made up of three FP10 channels for colour and 2 bits for alpha
also go faster for free and that's not easy to do otherwise.

The ROP partitions talk to GDDR3 memory only in GT200. We mention that
in passing since it affects how the architecture works due to burst
length, where you need to be sure to match what the DRAM wants every
time you feed it or ask for data in any given clock cycle, especially
when sampling. GDDR4 support seems non-existant, and we're certain
there's no GDDR5 support in the physical interface (PHY) either. The
number of ROP partitions means that with suitably fast memory, GT200
easily joins that exclusive club of microprocessors with more than
100GB/sec to their external DRAM devices. No other class of processor
in consumer computing enjoys that at the time of writing.
The ROP also improves on peak compression performance compared to both
G80 and G92, allowing it to do more with the available memory
bandwidth, not that 512-bit and fast graphics DRAMs mean there's a
lack of the stuff available to GT200-based SKUs, more on which later.

That's largely it in terms of the chip's new or changed architectural
traits in a basic sense. The questions posed now mostly become ones of
scheduling changes, and how memory access differs when compared to
prior implementations of the same basic architecture in the G8x and
G9x family of GPUs.




GT200: General Architecture Notes

We mentioned that the big questions posed now mostly become ones of
scheduling changes, and how memory access differs when compared to
prior implementations of the same basic architecture in the G8x and
G9x family of GPUs. Where it concerns the former question, it becomes
prudent to wonder whether the 'missing' MUL is finally available for
general shading (along with the revelation about its inclusion in G8x
and G9x, which we might one day share).

We've been able to verify freer issue of the instruction in general
shading, but not near the theoretical peak when the chip is executing
graphics codes. NVIDIA mention improvement to register allocation and
scheduling as the reason behind the freer execution of the MUL, and we
believe them. However it looks likely that it's only able to retire a
result every second clock because of operand fetch in graphics mode,
effectively halving its throughput. In CUDA mode, operand fetch seems
more flexible, with thoughput nearer peak, although we've not spent
enough time with the hardware yet to really be perfectly sure.
Regardless, at this point it seems impossible to extract the peak
figure of 933Gflops FP32 with our in-house graphics codes. How much
this matters depends on whether you can use the MUL implicitly through
attribute interpolation the rest of the time, which we aren't sure
about just yet either.

After that it's probably best to worry about GS performance in D3D10
graphical applications, which we'll do when it comes time to benchmark
the hardware. The new output buffer size increase is one of the bigger
architectural differences, maybe even more so than the addition of the
extra SM per cluster. Adoption of the GS stage in the D3D10 pipe has
undoubtedly been held back a little by the typical NVIDIA tactic of
building just enough in silicon to make a feature work, but building
too little to make it immediately useful.

The increase in register file, a doubling over the number of per-SM
registers available to G8x and G9x chips, means that there's less
pressure for the chip to decrease the number of possible in-flight
threads, letting latency hiding from the sampler hardware (it's the
same 200+ cycles latency to DRAM as with G80 from the core clock's
point of view) become more effective than it ever has done in the past
with this architecture. Performance becomes freer and easier in other
words, the schedulers more able to keep the cluster busy under heavy
shading loads. Developers now need to worry less about their
utilisation of the chip, not that we guess many really were with G80
and G92. The other G8x and G9x parts have different performance traits
for a developer to consider there, given how NVIDIA (annoyingly in the
low-end from a developer perspective) scaled them down from the
grandfather parts.

That per-SM shared memory didn't increase is interesting too. The way
the CUDA programming model works means that a static shared memory
size across generations is attractive for the application developer.
He or she doesn't have to tweak their codes too much to make the best
use of GT200, given that shared memory size didn't change. However
given that CUDA codes will have to be rewritten for GT200 anyway if
the application developer wants to make serious use of FP64
support.... ah, but that's comparatively slow in GT200, and heck,
16KiB for every SM is a fair aggregate chunk of SRAM when multiplied
out across the whole chip. 1.4B transistors sounds like room to
breathe, but we doubt NVIDIA see it as an excuse to be so blasé about
on-chip SRAM pools, even if they are inherently redundant parts of the
chip which will help yields of the beast.

Minor additional notes about the processing architecture include
improvements to how the input assembler can communicate with the DRAM
devices through the memory crossbar, allowing more efficient indexing
into memory contents when fetching primitive data, and a larger post-
transform cache to help feed the rasteriser a bit better. Primitive
setup rate is unchanged, which is a little disappointing given how
much you can be limited there during certain common and intensive
graphics operations. Assuming there's no catch, this is likely one of
the big reasons why performance improvements over G80 are more
impressive at ultra-high-end resolutions (along with the improved
bilinear filtering and ALU performance which also become more
important there).


GT200: Thoughts on positioning and the NVIO Display Pipe

It's easy enough to be blasé as the writer talking about the
architecture. Here's hoping the differences present don't add up to
conclusions of “it's just a wider G80” in the technical press. It's a
bit more than that, when surfaces are scratched (and sampled and
filtered, since we're talking about graphics).

The raw numbers do tell a tale, though, and it's no small piece of
silicon even in 55nm form as a 'GT200b'. In fact, it's easily the
biggest single piece of silicon ever sold to the general PC-buying
populace, and we're confident it'll hold that crown until well into
2009. When writing about GT200 I've found my mind wandering to that
horribly cheesy analogy that everyone loves to read about from the
linguistically-challenged technical writer. What do I compare it to
that everyone will recognise, that does it justice? I can't help but
imagine the Cloverfield monster wearing a dainty pair of pink
ballerina shoes, as it destroys everything in the run to the end game.
Elegant brawn, or something like that. You know what I mean. That also
means I get to wonder out loud and ask if ATI are ready to execute the
Hammer Down protocol.

It'll need to if it wants to conquer a product stack that'll see
NVIDIA make use of cheap G92 and G92b (55nm) based products underneath
the GT200-based models it's introducing today. That leads us on nicely
to talking about how NVIDIA can scale GT200 in order to have it drive
multiple products scaled not just in clock, but in enabled unit count.

GT200 is able to be scaled in terms of active cluster count and the
number of active ROP partitions, at a basic level. At a more advanced
level, the FP64 ALU is freely removed, and we fully assume that to be
the case for lower-end derivatives. For this chip though, it follows
the same redundancy and product scaling model that we famously saw
with G80 and then G92. So initially, we'll see a product based on the
full configuration of 10 clusters and 8 ROP partitions, with the full
512-bit external memory bus that brings. Along with that there'll be
an 8 cluster model with 448-bit memory interface (so a single ROP
partition disabled there). Nothing exciting then, and what one would
reasonably expect given the past history of chips with the same basic
architecture.
Display Pipe

We've tacked it on to the back end of the architecture discussion, but
it's worth mentioning because of how it's manfiest in hardware. So as
far as the display pipe goes, you've got the same 10bpc support as
G80, and it's via NVIO (a new revision) again this time. The video
engine is almost a direct cut and paste from G84, G92 et al, so we get
to call it VP2 and grumble under our breath about the overall state of
PC HD video in the wake of HD DVD losing out to BluRay. It's based on
Tensilica IP (just like AMD's UVD), NVIDIA using the company's area-
efficient DSP cores to create the guts of the video decode hardware,
with the shader core used to improve video quality rather than assist
in the decode process. The chip supports a full range of analogue and
digital display outputs, including HDMI with HDCP protection, as you'd
expect from a graphics product in the middle of 2008.

To portend to DisplayPort port support.... it's possible, but that's
up to the board vendor and whether they want to use an external
transmitter. Portunately they can.
From: Tim O on

Your copy and post pastes direct from web pages are so helpful to
people that don't know how to use a web browser!

Heres a great article I found on bowling!
http://www.articlesbase.com/sports-and-fitness-articles/further-enhancing-your-bowling-strategies-313261.html

Further Enhancing Your Bowling Strategies
Author: Jimmy Cox

The general style of the advanced bowler is already set. Below are
listed pointers eliminating faults, increasing speed and handling
spares. These are a great start to improving your game!

It might be well to point out right here that any change in one's
style almost automatically means a temporary drop in average. For
instance, if you decide to change your footwork, you might as well
face the fact that you will lose points while correcting yourself.

The important thing to remember, if and when you are satisfied in your
own mind that you are doing something fundamentally wrong, is that by
correcting the fault you will bring your average up higher than it
was.

The best time to do this correction work or practice is in the
summertime, when your experiments will not be at the expense of your
teammates. During this period, you have three or four months to work
out those kinks and to incorporate into your style the correct methods
you failed to use previously.

One fault leads to another.

It is an axiom of bowling that one key fault can cause two or three
other faults. Suppose a bowler takes his first step too fast.

That is the key fault, but it also results in poor timing, too fast
footwork, and being off balance at the foul line. Another key fault
might be allowing the right shoulder to be pulled back and out of
line, which brings on such other faults as improperly facing the pins,
finishing sideways at the foul line and a poor follow-through.

The key fault of lunging at the foul line ruins timing, makes the
release jerky, and may cause the bowler to hop.

Get rid of individual faults only when necessary.

You may have a particular flaw in your game, but if you do the same
thing consistently and successfully, do not change. There are bowlers
today averaging 200 who do not have a good follow-through, or who have
too high a backswing or who possess some other fault. But they have
learned to incorporate that flaw into their game so well that they are
consistent, and their game might fall apart if they attempted to
change it.

In this regard, I might point out that I am not referring here to
those bowlers who are not high average bowlers and are afraid to
change, despite the fact that they possess an obvious flaw in their
game.

There are several ways in which to increase your speed.

You might use any or all of these to succeed. Here they are:

a. Hold the ball higher in your starting position. This will help give
you a longer pendulum swing.

b. Use more pushaway when you begin. Push it farther out, if you have
been negligent in that phase.

c. Increase your backswing. Perhaps you have been bringing the ball no
higher than your waist on the backswing. Remember that you can bring
it back as high as the shoulder without violating the fundamental rule
in this regard.

d. Work on more perfect timing. Perfect timing gives you the maximum
amount of natural speed. If you have had trouble getting good speed,
perhaps you have been coming to a full stop at the foul line before
your right arm begins its swing. Perfect timing will increase your
speed and is far better for you and for your game than trying to force
the ball.

Do you play spares properly?

Here are the three rules:

a. Face your target from the correct angle. Square your shoulders to
the target.

b. Walk directly toward your target. In the cases of the 7-pin and the
10-pin, this means walking directly toward that pin, which will cause
you to go to the foul line at a slight angle.

c. Make sure that you have your right arm following through directly
toward your target. Get your right arm out to where you are looking,
whether this be a pin or a spot.

Work on the above points conscientiously and your game will improve
dramatically. Just keep going!
From: Cool on

"NV55" <nvidianv55(a)mail.com> wrote in message
news:9eb33d1e-cacc-4f5b-895f-d0a54f265008(a)j22g2000hsf.googlegroups.com...

Yup and they can keep it. Not impressed at all.