NVIDIA's Fermi: Architected for Tesla, 3 Billion Transistors in 2010
by Anand Lal Shimpi on September 30, 2009 12:00 AM EST- Posted in
- GPUs
A More Efficient Architecture
GPUs, like CPUs, work on streams of instructions called threads. While high end CPUs work on as many as 8 complicated threads at a time, GPUs handle many more threads in parallel.
The table below shows just how many threads each generation of NVIDIA GPU can have in flight at the same time:
Fermi | GT200 | G80 | |
Max Threads in Flight | 24576 | 30720 | 12288 |
Fermi can't actually support as many threads in parallel as GT200. NVIDIA found that the majority of compute cases were bound by shared memory size, not thread count in GT200. Thus thread count went down, and shared memory size went up in Fermi.
NVIDIA groups 32 threads into a unit called a warp (taken from the looming term warp, referring to a group of parallel threads). In GT200 and G80, half of a warp was issued to an SM every clock cycle. In other words, it takes two clocks to issue a full 32 threads to a single SM.
In previous architectures, the SM dispatch logic was closely coupled to the execution hardware. If you sent threads to the SFU, the entire SM couldn't issue new instructions until those instructions were done executing. If the only execution units in use were in your SFUs, the vast majority of your SM in GT200/G80 went unused. That's terrible for efficiency.
Fermi fixes this. There are two independent dispatch units at the front end of each SM in Fermi. These units are completely decoupled from the rest of the SM. Each dispatch unit can select and issue half of a warp every clock cycle. The threads can be from different warps in order to optimize the chance of finding independent operations.
There's a full crossbar between the dispatch units and the execution hardware in the SM. Each unit can dispatch threads to any group of units within the SM (with some limitations).
The inflexibility of NVIDIA's threading architecture is that every thread in the warp must be executing the same instruction at the same time. If they are, then you get full utilization of your resources. If they aren't, then some units go idle.
A single SM can execute:
Fermi | FP32 | FP64 | INT | SFU | LD/ST |
Ops per clock | 32 | 16 | 32 | 4 | 16 |
If you're executing FP64 instructions the entire SM can only run at 16 ops per clock. You can't dual issue FP64 and SFU operations.
The good news is that the SFU doesn't tie up the entire SM anymore. One dispatch unit can send 16 threads to the array of cores, while another can send 16 threads to the SFU. After two clocks, the dispatchers are free to send another pair of half-warps out again. As I mentioned before, in GT200/G80 the entire SM was tied up for a full 8 cycles after an SFU issue.
The flexibility is nice, or rather, the inflexibility of GT200/G80 was horrible for efficiency and Fermi fixes that.
415 Comments
View All Comments
Kougar - Friday, October 2, 2009 - link
Hey Anand:Just wanted to say thanks for the article. Love the quotes and behind-the-scene views, and in general the ever so informative articles like this that just can't be found elsewhere. So, thank you!
bobvodka - Friday, October 2, 2009 - link
Someone earlier askes if supporting doubles was going to waste silicon, I don't think it will.If you look at the through put numbers and the fact that FP64 is half that of FP32 with the SFU disabled I suspect what is going on is that the FP64 calculations are being done by 2 cores at once with the SFU being involved in some way (given how it is decoupled from the cores there is no apprent good reason why the SFU should be disabled during FP64 operation).
A comment was also made re:ECC memory.
I suspect this wont make it to the consumer board; there is no good reason to do so and it would just cost silicon and power for a feature users don't need.
Zool - Friday, October 2, 2009 - link
Maybe the consumer board wont hawe ECC but it will be still in the silicon (disabled). I dont think that they will produce two different silicons just becouse of ECC.bobvodka - Friday, October 2, 2009 - link
hmmm, you are probably right on that score and that might aid yield if they can turn it off as any faults in the ECC areas could be safely ignored.Chances of them using ECC ram on the boards themselves I would have said was zero simply due to cost :)
halcyon - Friday, October 2, 2009 - link
Same foundry, same process, much more transistors....Based on roughly extrapolating scaling from the RV870, how much bigger power draw would this baby have?
The dollar draw from my wallet is going to be really powerful, that's for sure, but how about power?
deeper - Friday, October 2, 2009 - link
Well, not only is the GT300 months away but it looks like the card they showed off is a fake anyhoo, check it out at Charlie Demerjian's www.semiaccurate.comZool - Friday, October 2, 2009 - link
Could you pls delete majority of SiliconDoc replies and than this after them. Its embarassing to read them.Pirks - Friday, October 2, 2009 - link
I call BS. How many people have 2560x1600 30-inchers? Two? Three? Main point - resolutions are _VERY_ far from being stagnated, they have SOOOOOOOOO _MUCH_ room for growth until 2560x1600 which right now covers maybe 1% of the PC gaming market. 90% of PC gamers still use low-res 1680x1050 if not less (I for one have 1400x1050, yeah shame on me, I don't want to spend $800 on hi-end SLI setup just to play Crysis in all its hi-res beauty, for.get.it.)Shame Anand, real shame.
Otherwise top notch quality stuff, as always with Ananad.
bigboxes - Friday, October 2, 2009 - link
1680x1050 = low res??? Seriously? That's hi-def bro. I understand you can do better, but for my 20" widescreen it is definitley hi-def.JarredWalton - Friday, October 2, 2009 - link
I believe what you describe is exactly what is meant by stagnation. From Merriam-Webster: "To become stagnant." Stagnant: "Not advancing or developing." So yeah, I'd say that pretty much sums up display resolutions: they're not advancing.Is that bad? Not necessarily, especially when we have so many applications that do thing based purely on the wonderful pixel instead of on points or DPI. I use a 30" LCD, and I love the extra resolution for working with images, but the text by default tends to be too small. I have to zoom to 150% in a lot of apps (including Firefox/IE) to get what I consider comfortably readable text. I would say that 2560x1600 on a 30" LCD is about as much as I see myself needing for a good, looooong time.