|
From: NV55 on 16 Jun 2008 15:06 http://www.beyond3d.com/content/reviews/51 NVIDIA GT200 GPU and Architecture Analysis Published on 16th Jun 2008, written by Rys for Consumer Graphics - Last updated: 15th Jun 2008 Introduction Sorry G80, your time is up. There's no arguing that NVIDIA's flagship D3D10 GPU has held a reign over 3D graphics that never truly saw it usurped, even by G92 and a dubiously named GeForce 9-series range. The high-end launch product based on G80, GeForce 8800 GTX, is still within spitting distance of anything that's come out since in terms of raw single-chip performance. It flaunts its 8 clusters, 384-bit memory bus and 24 ROPs in the face of G92, meaning that products like 9800 GTX have never really felt like true upgrades to owners of G80-based products. That I type this text on my own PC powered by a GeForce 8800 GTX, one that I bought -- which is largely unheard of in the world of tech journalism; as a herd, we never usually buy PC components -- with my own hard-earned, and on launch day no less, speaks wonders for the chip's longevity. I'll miss you old girl, your 20 month spell at the top of the pile is now honestly up. So what chip the usurper, and how far has it moved the game on? Rumours about GT200 have swirled for some time, and recently the rumour mill has mostly got it right. The basic architecture is pretty much a known quantity at this point, and it's a basic architecture that shares a lot of common ground with the one powering the chip we've just eulogised. Why mess too much with what's worked so well, surely? "Correctamundo", says the Fonz, and the Fonz is always right. It's all about the detail now, so we'll try and reveal as much as possible to see where the deviance can be found. We'll delve into the architecture first, before taking a look at the first two products it powers, looking back to previous NVIDIA D3D10 hardware as necessary to paint the picture. NVIDIA GT200 Overview The following diagram represents a high-level look at how GT200 is architected and what some of the functional units are capable of. It's a similar chip to G80, of that there's no doubt, but the silicon surgery undertaken by NVIDIA's architects to create it means we have quite a different beast when you take a look under the surface. http://www.beyond3d.com/images/reviews/gt200-arch/GT200-full-1.2-26-05-08.png If it's not clear from the above diagram, like G80, GT200 is a fully- unified, heavily-threaded, self load-balancing (full time, agnostic of API) shading architecture. It has decoupled and threaded data processing, allowing the hardware to fully realise the goal of hiding sampler latency by scheduling sampler threads independently of, and asynchronously with, shading threads. The design goals of the chip appear to be the improvement of D3D10 performance in general, especially at the Geometry Shader stage, with the end result presumably as close to doubling the performance of a similarly clocked G92 as possible. There's not 2x the raw performance available everywhere on the chip of course, but the increase in certain computation resources should see it achieve something like that in practice, depending on what's being rendered or computed. Let's look closer at the chip architecture, then. The analysis was written with our original look at G80 in mind. The architecture we discussed there is the basis for what we'll talk about today, so have a good read of that to refresh your memory, and/or ask in the forums if anything doesn't make sense. The original piece is a little outdated in places, as we've discovered more about the chip as time goes by over the last year and a half, so just ask about or let us know about something that doesn't quite fit. GT200: The Shading Core http://www.beyond3d.com/images/reviews/gt200-arch/shader-core.png GT200 demonstrates subtle yet distinct architectural differences when compared to G80, the chip that pioneered the basic traits of this generation of GPUs from Kirk and Co. As we've alluded to, G80 led a family of chips that have underpinned the company's dominance over AMD in the graphics space since its launch, so it's no surprise to see NVIDIA stick to the same themes of execution, use of on-chip memories, and approach to acceleration of graphics and non-graphics computation. At its core, GT200 is a MIMD array of SIMD processors, partitioned into what we call clusters, with each cluster a 3-way collection of shader processors which we call an SM. Each SM, or streaming multiprocessor, comprises 8 scalar ALUs, with each capable of FP32 and 32-bit integer computation (the only exception being multiplication, which is INT24 and therefore still takes 4 cycles for INT32), a single 64-bit ALU for brand new FP64 support, and a discrete pool of shared memory 16KiB in size. The FP64 ALU is notable not just in its inclusion, NVIDIA supporting 64-bit computation for the first time in one of its graphics processors, but in its ability. It's capable of a double precision MAD (or MUL or ADD) per clock, supports 32-bit integer computation, and somewhat surprisingly, signalling of a denorm at full speed with no cycle penalty, something you won't see in any other DP processor readily available (such as any x86 or Cell). The ALU uses the MAD to accelerate software support for specials and divides, where possible. Those ALUs are paired with another per-SM block of computation units, just like G80, which provide scalar interpolation of attributes for shading and a single FP-only MUL ALU. That lets each SM potentially dual-issue 8 MAD+MUL instruction pairs per clock for general shading, with the MUL also assisting in attribute setup when required. However, as you'll see, that dual-issue performance depends heavily on input operand bandwidth. Each warp of threads still runs for four clocks per SM, with up to 1024 threads managed per SM by the scheduler (which has knock-on effects for the programmer when thinking about thread blocks per cluster). The hardware still scales back threads in flight if there's register pressure of course, but that's going to happen less now the RF has doubled in size per SM (and it might happen more gracefully now to boot). So, along with that pool of shared memory is connection to a per-SM register file comprising 16384 32-bit registers, double that available for each SM in G80. Each SP in each SM runs the same instruction per clock as the others, but each SM in a cluster can run its own instruction. Therefore in any given cycle, SMs in a cluster are potentially executing a different instruction in a shader program in SIMD fashion. That goes for the FP64 ALU per SM too, which could execute at the same time as the FP32 units, but it shares datapaths to the RF, shared memory pools, and scheduling hardware with them so the two can't go full-on at the same time (presumably it takes the place of the MUL/SFU, but perhaps it's more flexible than that). Either way, it's not currently exposed outside of CUDA or used to boost FP32 performance. That covers basic execution across a cluster using its own memory pools. Across the shader core, each SM in each cluster is able to run a different instruction for a shader program, giving each SM its own program counter, scheduling resources, and discrete register file block. A processing thread started on one cluster can never execute on any other, although another thread can take its place every cycle. The SM schedulers implement execution scoreboarding and are fed from the global scheduler and per thread-type setup engines, one for VS, one for GS and one for PS threads.
From: NV55 on 16 Jun 2008 15:11 GT200: Sampling and the ROP http://www.beyond3d.com/images/reviews/gt200-arch/tpc.png For data fetch and filtering, each cluster is connected to its own discrete sampler unit (with cluster + samplers called the texture processing cluster or TPC by NVIDIA), with each one able to calculate 8 sample addresses and bilinearly filter 8 samples per clock. That's unchanged compared to G92, but it's worth pointing out that prior hardware could never reach the bilinear peak outside of (strangely enough) scalar FP32 textures. It's now obtainable (or at least much closer) thanks to, according to NVIDIA, tweaks to the thread scheduler and sampler I/O. We still heavily suspect though that one of the key reasons is additional shared INT16 hardware for what we imagine actually is a shared addressing/filtering unit. Either way, each sampler has a dedicated L1 cache which is likely 16KiB and all sampler units share a global L2 cache that we believe is double the size of that in G80 at 256KiB. The sampler hardware runs at the chip base clock, whereas the shading units run at the chip hot clock, which is most easily thought of as being 2x the scheduler clock. Along with the memory clock, those mentioned clocks comprise the main domains in GT200, just like they did in G80. The hardware is advertised as supporting D3D10.0, since its architecture is marginally incapable of supporting 10.1, by virtue of the ROP hardware. D3D10 compliance means the ability in hardware for recycling data from GS stage of the computation model back through the chip for another pass. The output buffer for that is six times larger in GT200 than in G80, although NVIDIA don't disclose the exact size. Given that the GS stage is capable of data amplification (and de- amplification of course), the increased buffer size represents a significant change in what the architecture is capable of in a performance sense, if not a theoretical sense. The same per-thread output limits are present, but now more GS threads can now be run at the same time. That covers the changes to on-chip memories that each cluster has access to. Quickly returning to the front of the chip, It appears that the hardware can still only setup a single triangle per clock, and the rasteriser is largely unchanged. Remember that in G80, the rasteriser worked on 32 pixel blocks, correlating to the pixel batch size. GT200 continues to work on the same size pixel blocks as it sends the screen down through the clusters as screen tiles for shading. http://www.beyond3d.com/images/reviews/g80-arch/g80-quad-rop.png At the back of the chip, after computation via each TPC, the same basic ROP architecture as G80 is present. With the external memory bus 512 bits wide this time and each 64-bit memory channel serving a ROP partition, that means 8 ROP partitions, each partition housing a quartet of ROP units. 32 in total then. Each ROP is now capable of a full-speed INT8 or FP16 channel blend per cycle, whereas G80 needed two cycles to complete the same operations. This guarantees that blending isn't ROP limited, which could already be the case on G80 and would have become even more of a problem with a higher memory/core clock ratio. It might also initially seem odd that FP16 is also supported at full-speed despite being certainly bandwidth limited, but remember that full-speed FP16 also means that 32-bit floating point pixels made up of three FP10 channels for colour and 2 bits for alpha also go faster for free and that's not easy to do otherwise. The ROP partitions talk to GDDR3 memory only in GT200. We mention that in passing since it affects how the architecture works due to burst length, where you need to be sure to match what the DRAM wants every time you feed it or ask for data in any given clock cycle, especially when sampling. GDDR4 support seems non-existant, and we're certain there's no GDDR5 support in the physical interface (PHY) either. The number of ROP partitions means that with suitably fast memory, GT200 easily joins that exclusive club of microprocessors with more than 100GB/sec to their external DRAM devices. No other class of processor in consumer computing enjoys that at the time of writing. The ROP also improves on peak compression performance compared to both G80 and G92, allowing it to do more with the available memory bandwidth, not that 512-bit and fast graphics DRAMs mean there's a lack of the stuff available to GT200-based SKUs, more on which later. That's largely it in terms of the chip's new or changed architectural traits in a basic sense. The questions posed now mostly become ones of scheduling changes, and how memory access differs when compared to prior implementations of the same basic architecture in the G8x and G9x family of GPUs. GT200: General Architecture Notes We mentioned that the big questions posed now mostly become ones of scheduling changes, and how memory access differs when compared to prior implementations of the same basic architecture in the G8x and G9x family of GPUs. Where it concerns the former question, it becomes prudent to wonder whether the 'missing' MUL is finally available for general shading (along with the revelation about its inclusion in G8x and G9x, which we might one day share). We've been able to verify freer issue of the instruction in general shading, but not near the theoretical peak when the chip is executing graphics codes. NVIDIA mention improvement to register allocation and scheduling as the reason behind the freer execution of the MUL, and we believe them. However it looks likely that it's only able to retire a result every second clock because of operand fetch in graphics mode, effectively halving its throughput. In CUDA mode, operand fetch seems more flexible, with thoughput nearer peak, although we've not spent enough time with the hardware yet to really be perfectly sure. Regardless, at this point it seems impossible to extract the peak figure of 933Gflops FP32 with our in-house graphics codes. How much this matters depends on whether you can use the MUL implicitly through attribute interpolation the rest of the time, which we aren't sure about just yet either. After that it's probably best to worry about GS performance in D3D10 graphical applications, which we'll do when it comes time to benchmark the hardware. The new output buffer size increase is one of the bigger architectural differences, maybe even more so than the addition of the extra SM per cluster. Adoption of the GS stage in the D3D10 pipe has undoubtedly been held back a little by the typical NVIDIA tactic of building just enough in silicon to make a feature work, but building too little to make it immediately useful. The increase in register file, a doubling over the number of per-SM registers available to G8x and G9x chips, means that there's less pressure for the chip to decrease the number of possible in-flight threads, letting latency hiding from the sampler hardware (it's the same 200+ cycles latency to DRAM as with G80 from the core clock's point of view) become more effective than it ever has done in the past with this architecture. Performance becomes freer and easier in other words, the schedulers more able to keep the cluster busy under heavy shading loads. Developers now need to worry less about their utilisation of the chip, not that we guess many really were with G80 and G92. The other G8x and G9x parts have different performance traits for a developer to consider there, given how NVIDIA (annoyingly in the low-end from a developer perspective) scaled them down from the grandfather parts. That per-SM shared memory didn't increase is interesting too. The way the CUDA programming model works means that a static shared memory size across generations is attractive for the application developer. He or she doesn't have to tweak their codes too much to make the best use of GT200, given that shared memory size didn't change. However given that CUDA codes will have to be rewritten for GT200 anyway if the application developer wants to make serious use of FP64 support.... ah, but that's comparatively slow in GT200, and heck, 16KiB for every SM is a fair aggregate chunk of SRAM when multiplied out across the whole chip. 1.4B transistors sounds like room to breathe, but we doubt NVIDIA see it as an excuse to be so blasé about on-chip SRAM pools, even if they are inherently redundant parts of the chip which will help yields of the beast. Minor additional notes about the processing architecture include improvements to how the input assembler can communicate with the DRAM devices through the memory crossbar, allowing more efficient indexing into memory contents when fetching primitive data, and a larger post- transform cache to help feed the rasteriser a bit better. Primitive setup rate is unchanged, which is a little disappointing given how much you can be limited there during certain common and intensive graphics operations. Assuming there's no catch, this is likely one of the big reasons why performance improvements over G80 are more impressive at ultra-high-end resolutions (along with the improved bilinear filtering and ALU performance which also become more important there). GT200: Thoughts on positioning and the NVIO Display Pipe It's easy enough to be blasé as the writer talking about the architecture. Here's hoping the differences present don't add up to conclusions of it's just a wider G80 in the technical press. It's a bit more than that, when surfaces are scratched (and sampled and filtered, since we're talking about graphics). The raw numbers do tell a tale, though, and it's no small piece of silicon even in 55nm form as a 'GT200b'. In fact, it's easily the biggest single piece of silicon ever sold to the general PC-buying populace, and we're confident it'll hold that crown until well into 2009. When writing about GT200 I've found my mind wandering to that horribly cheesy analogy that everyone loves to read about from the linguistically-challenged technical writer. What do I compare it to that everyone will recognise, that does it justice? I can't help but imagine the Cloverfield monster wearing a dainty pair of pink ballerina shoes, as it destroys everything in the run to the end game. Elegant brawn, or something like that. You know what I mean. That also means I get to wonder out loud and ask if ATI are ready to execute the Hammer Down protocol. It'll need to if it wants to conquer a product stack that'll see NVIDIA make use of cheap G92 and G92b (55nm) based products underneath the GT200-based models it's introducing today. That leads us on nicely to talking about how NVIDIA can scale GT200 in order to have it drive multiple products scaled not just in clock, but in enabled unit count. GT200 is able to be scaled in terms of active cluster count and the number of active ROP partitions, at a basic level. At a more advanced level, the FP64 ALU is freely removed, and we fully assume that to be the case for lower-end derivatives. For this chip though, it follows the same redundancy and product scaling model that we famously saw with G80 and then G92. So initially, we'll see a product based on the full configuration of 10 clusters and 8 ROP partitions, with the full 512-bit external memory bus that brings. Along with that there'll be an 8 cluster model with 448-bit memory interface (so a single ROP partition disabled there). Nothing exciting then, and what one would reasonably expect given the past history of chips with the same basic architecture. Display Pipe We've tacked it on to the back end of the architecture discussion, but it's worth mentioning because of how it's manfiest in hardware. So as far as the display pipe goes, you've got the same 10bpc support as G80, and it's via NVIO (a new revision) again this time. The video engine is almost a direct cut and paste from G84, G92 et al, so we get to call it VP2 and grumble under our breath about the overall state of PC HD video in the wake of HD DVD losing out to BluRay. It's based on Tensilica IP (just like AMD's UVD), NVIDIA using the company's area- efficient DSP cores to create the guts of the video decode hardware, with the shader core used to improve video quality rather than assist in the decode process. The chip supports a full range of analogue and digital display outputs, including HDMI with HDCP protection, as you'd expect from a graphics product in the middle of 2008. To portend to DisplayPort port support.... it's possible, but that's up to the board vendor and whether they want to use an external transmitter. Portunately they can.
From: Tim O on 16 Jun 2008 15:26 Your copy and post pastes direct from web pages are so helpful to people that don't know how to use a web browser! Heres a great article I found on bowling! http://www.articlesbase.com/sports-and-fitness-articles/further-enhancing-your-bowling-strategies-313261.html Further Enhancing Your Bowling Strategies Author: Jimmy Cox The general style of the advanced bowler is already set. Below are listed pointers eliminating faults, increasing speed and handling spares. These are a great start to improving your game! It might be well to point out right here that any change in one's style almost automatically means a temporary drop in average. For instance, if you decide to change your footwork, you might as well face the fact that you will lose points while correcting yourself. The important thing to remember, if and when you are satisfied in your own mind that you are doing something fundamentally wrong, is that by correcting the fault you will bring your average up higher than it was. The best time to do this correction work or practice is in the summertime, when your experiments will not be at the expense of your teammates. During this period, you have three or four months to work out those kinks and to incorporate into your style the correct methods you failed to use previously. One fault leads to another. It is an axiom of bowling that one key fault can cause two or three other faults. Suppose a bowler takes his first step too fast. That is the key fault, but it also results in poor timing, too fast footwork, and being off balance at the foul line. Another key fault might be allowing the right shoulder to be pulled back and out of line, which brings on such other faults as improperly facing the pins, finishing sideways at the foul line and a poor follow-through. The key fault of lunging at the foul line ruins timing, makes the release jerky, and may cause the bowler to hop. Get rid of individual faults only when necessary. You may have a particular flaw in your game, but if you do the same thing consistently and successfully, do not change. There are bowlers today averaging 200 who do not have a good follow-through, or who have too high a backswing or who possess some other fault. But they have learned to incorporate that flaw into their game so well that they are consistent, and their game might fall apart if they attempted to change it. In this regard, I might point out that I am not referring here to those bowlers who are not high average bowlers and are afraid to change, despite the fact that they possess an obvious flaw in their game. There are several ways in which to increase your speed. You might use any or all of these to succeed. Here they are: a. Hold the ball higher in your starting position. This will help give you a longer pendulum swing. b. Use more pushaway when you begin. Push it farther out, if you have been negligent in that phase. c. Increase your backswing. Perhaps you have been bringing the ball no higher than your waist on the backswing. Remember that you can bring it back as high as the shoulder without violating the fundamental rule in this regard. d. Work on more perfect timing. Perfect timing gives you the maximum amount of natural speed. If you have had trouble getting good speed, perhaps you have been coming to a full stop at the foul line before your right arm begins its swing. Perfect timing will increase your speed and is far better for you and for your game than trying to force the ball. Do you play spares properly? Here are the three rules: a. Face your target from the correct angle. Square your shoulders to the target. b. Walk directly toward your target. In the cases of the 7-pin and the 10-pin, this means walking directly toward that pin, which will cause you to go to the foul line at a slight angle. c. Make sure that you have your right arm following through directly toward your target. Get your right arm out to where you are looking, whether this be a pin or a spot. Work on the above points conscientiously and your game will improve dramatically. Just keep going!
From: Cool on 19 Jun 2008 03:17 "NV55" <nvidianv55(a)mail.com> wrote in message news:9eb33d1e-cacc-4f5b-895f-d0a54f265008(a)j22g2000hsf.googlegroups.com... Yup and they can keep it. Not impressed at all.
|
Pages: 1 Prev: Nvidia drops GTX 280 to $499 ?? Next: Intel's Larrabee to also be presented at Hot Chips |