NVIDIA's 1.4 Billion Transistor GPU: GT200 Arrives as the GeForce GTX 280 & 260
by Anand Lal Shimpi & Derek Wilson on June 16, 2008 9:00 AM EST- Posted in
- GPUs
Building NVIDIA's GT200
Here's a Streaming Processor, NVIDIA calls it an SP:
NVIDIA calls an individual SP a single processing core, which is actually true. It is a fully pipelined, single-issue, in-order microprocessor complete with two ALUs and a FPU. An SP doesn't have any cache, so it's not particularly great at anything other than cranking through tons of mathematical operations. Since an SP spends most of its time working on pixel or vertex data, the fact that it doesn't have a cache doesn't really matter. Aside from name similarities, one NVIDIA SP is a lot like a very simplified version of a SPE in the Cell microprocessor (or maybe the SPE is like a really simple version of one of NVIDIA's SMs, which we'll get to in a minute). While a single SPE in Cell has seven execution units, a single NVIDIA SP only has three.
By itself a SP is fairly useless, but NVIDIA builds GPUs and if you add up enough of these little monsters you can start to get something productive given that graphics rendering is a highly parallelizable task.
Here's a Streaming Multiprocessor, which NVIDIA abbreviates as SM:
A SM is an array of SPs, eight to be specific, along with two more processors called Special Function Units (SFUs). Each SFU has four FP multiply units which are used for transcendental operations (e.g. sin, cosin) and interpolation, the latter being used in some of the calculations for things like anisotropic texture filtering. Although NVIDIA isn't specific in saying so, we assume that each SFU is also a fully pipelined, single-issue, in-order microprocessor. There's a MT issue unit that dispatches instructions to all of the SPs and SFUs in the group.
In addition to the processor cores in a SM, there's a very small instruction cache, a read only data cache and a 16KB read/write shared memory. These cache sizes are kept purposefully small because unlike a conventional desktop microprocessor, the datasets we're trying to cache here are small. Each SP ends up working on an individual pixel and despite the move to 32-bit floating point values, there's only so much data associated with a single pixel. The 16KB memory is akin to Cell's local stores in that it's not a cache, but a software-managed data store so that latency is always predictable. With this many cores in a single SM, control and predictability and very important to making the whole thing work efficiently.
Take one more step back and you've got a Texture/Processor Cluster (TPC):
The G80/G92 TPC (left) vs. the GT200 TPC (right)
NVIDIA purposefully designed its GPU architecture to be modular, so a single TPC can be made up of any number of SMs. In the G80 architecture it was made up of two SMs but with the GT200 architecture it now has three SMs.
The components of the TPC however haven't changed; a TPC is made up of SMs, some control logic and a texture block. Remember that a SM is a total of 8 SPs and 2 SFUs, so that brings the total up to 24 SPs and 6 SFUs (must...not...type...STFU) per cluster in GT200 (up from 16 SPs and 4 SFUs in G80). The texture block includes texture addressing and filtering logic as well as a L1 texture cache.
The modular theme continues with the Streaming Processor Array (SPA) that is composed of a number of TPCs:
The GT200 SPA, that's 240 SPs in there if you want to count them
In G80 the SPA was made up of 8 TPCs, but with GT200 we've moved up to 10. Note that each TPC now has 3 SMs vs. 2, so the overall processing power of GT200 has increased by 87.5% over G80.
And here's G80/G92, only 128 SPs thanks to two SMs per TPC and 8 TPCs
At the front end of the GPU we've got schedulers and control logic to distribute workloads to the entire array of processing cores. At the other end we've got L2 texture caches and rasterization processors that handle final filtering and output of data to the frame buffer.
The culmination of all of this is that the new GT200 GPU, the heart of the GeForce GTX 280 and 260, features 240 SPs, 160KB of local memory, an even smaller amount of cache and is built on TSMC's 65nm process using 1.4 billion transistors.
1.4 Billion Transistors. It wants vertex data. Really bad.
754 Million Transistors
There are more transistors in this chip than there are people in China, and it's the largest, most compute-dense chip we've ever reviewed.
108 Comments
View All Comments
gigahertz20 - Monday, June 16, 2008 - link
I think these ridiculous prices and lackluster performance is just a way for them to sell more SLI motherboards, who would buy a $650 GTX 280 when you can buy two 8800GT's with a SLI mobo and get better performance? Especially now that the 8800GT's are approaching around $150.crimson117 - Monday, June 16, 2008 - link
It's only worth riding the bleeding edge when you can afford to stay there with every release. Otherwise, 12 months down the line, you have no budget left for an upgrade, while everyone else is buying new $200 cards that beat your old $600 card.So yeah you can buy an 8800GT or two right now, and you and me should probably do just that! But Richie Rich will be buying 2x GTX 280's, and by the time we could afford even one of those, he'll already have ordered a pair of whatever $600 cards are coming out next.
7Enigma - Tuesday, June 17, 2008 - link
Nope, the majority of these cards go to Alienware/Falcon/etc. top of the line, overpriced pre-built systems. These are for the people that blow $5k on a system every couple years, don't upgrade, might not even seriously game, they just want the best TODAY.They are the ones that blindly check the bottom box in every configuration for the "fastest" computer money can buy.
gigahertz20 - Monday, June 16, 2008 - link
Very few people are richie rich and stay at the bleeding edge. People that are very wealthy tend not to be computer geeks and purchase their computers from Dell and what not. I'd say at least 96% of gamers out there are value oriented, these $650 cards will not sell much at all. If anything, you'll see people claim to have bought one or two of these in forums and other places, but their just lying.perzy - Monday, June 16, 2008 - link
Well I for one is waiting for Larabee. Maybee (probably) it isen' all that its cranked up to be, but I want to see.And what about some real powersaving Nvidia?
can - Monday, June 16, 2008 - link
I wonder if you can just flash the BIOS of the 260 to get it to operate as if it were a 280...7Enigma - Tuesday, June 17, 2008 - link
You haven't been able to do this for a long time....they learned their lessons the hard way. :)Nighteye2 - Monday, June 16, 2008 - link
Is it just me, or does this focus on compute power mean Nvidia is starting to get serious about using the GPU for physics, as well as graphics? It's also in-line with the Ageia acquisition.will889 - Monday, June 16, 2008 - link
At the point where NV has actually managed to position SLI mobos and GPU's where you actually need that much power to get decent FPS (above 30 average) from games gaming on the PC will be entirely dead to all those but the most esoteric. It would be different if there were any games worth playing or as many games as the console brethren have. I thought GPU's/cases/power supplies were supposed to become more efficient? EG smaller but faster sort of how the TV industry made TV's bigger yet smaller in footprint with way more features - not towering cases with 1200KW PSU's and 2X GTX 280 GPU's? All this in the face o drastically raised gas prices?Wanna impress me? How about a single GPU with the PCB size of a 7600GT/GS that's 15-25% faster than a 9800GTX that can fit into a SFF case? needing a small power supply AND able to run passively @ moderate temps. THAT would be impressive. No, Seargent Tom and his TONKA_TRUCK crew just have to show how beefy his toys can be and yank your wallet chains for said. Hell, everyone needs a Boeing 747 in their case right? cause' that's progress for those 1-2 gaming titles per years that give you 3-4 hours of enjoyable PC gaming.....
/off box
ChronoReverse - Tuesday, June 17, 2008 - link
The 4850 might actually hit that target...