ATI's Late Response to G70 - Radeon X1800, X1600 and X1300
by Derek Wilson on October 5, 2005 11:05 AM EST- Posted in
- GPUs
Pipeline Layout and Details
The general layout of the pipeline is very familiar. We have some number of vertex pipelines feeding through a setup engine into a number of pixel pipelines. After fragment processing, data is sent to the back end for things like fog, alpha blending and Z compares. The hardware can easily be scaled down at multiple points; vertex pipes, pixel pipes, Z compare units, texture units, and the like can all be scaled independently. Here's an overview of the high end case.
The maximum number of vertex pipelines in the X1000 series that it can handle is 8. Mid-range and budget parts incorporate 5 and 2 vertex units respectively. Each vertex pipeline is capable of one scalar and one vector operation per clock cycle. The hardware can support 1024 instruction shader programs, but much more can be done in those instructions with flow control for looping and branching.
After leaving the vertex pipelines and geometry setup hardware, the data makes its way to the "ultra threading" dispatch processor. This block of hardware is responsible for keeping the pixel pipelines fed and managing which threads are active and running at any given time. Since graphics architectures are inherently very parallel, quite a bit of scheduling work within a single thread can easily be done by the compiler. But as shader code is actually running, some instruction may need to wait on data from a texture fetch that hasn't completed or a branch whose outcome is yet to be determined. In these cases, rather than spin the clocks without doing any work, ATI can run the next set of instructions from another "thread" of data.
Threads are made up of 16 pixels each and up to 512 can be managed at one time (128 in mid-range and budget hardware). These threads aren't exactly like traditional CPU threads, as programmers do not have to create each one specifically. With graphics data, even with only one shader program running, a screen is automatically divided into many "threads" running the same program. When managing multiple threads, rather than requiring a context switch to process a different set of instructions running on different pixels, the GPU can keep multiple contexts open at the same time. In order to manage having any viable number of registers available to any of 512 threads, the hardware needs to manage a huge internal register file. But keeping as many threads, pixels, and instructions in flight at a time is key in managing and effectively hiding latency.
NVIDIA doesn't explicitly talk about hardware analogous to ATI's "ultra threading dispatch processor", but they must certainly have something to manage active pixels as well. We know from our previous NVIDIA coverage that they are able to keep hundreds of pixels in flight at a time in order to hide latency. It would not be possible or practical to give the driver complete control of scheduling and dispatching pixels as too much time would be wasted deciding what to do next.
We won't be able to answer specifically the question of which hardware is better at hiding latency. The hardware is so different and instructions will end up running through alternate paths on NVIDIA and ATI hardware. Scheduling quads, pixels, and instructions is one of the most important tasks that a GPU can do. Latency can be very high for some data and there is no excuse to let the vast parallelism of the hardware and dataset to go to waste without using it for hiding that latency. Unfortunately, there is just no test that we have currently to determine which hardware's method of scheduling is more efficient. All we can really do for now is look at the final performance offered in games to see which design appears "better".
One thing that we do know is that ATI is able to keep loop granularity smaller with their 16 pixel threads. Dynamic branching is dependant on the ability to do different things on different pixels. The efficiency of an algorithm breaks down if hardware requires that too many pixels follow the same path through a program. At the same time, the hardware gets more complicated (or performance breaks down) if every pixel were to be treated completely independently.
On NVIDIA hardware, programmers need to be careful to make sure that shader programs are designed to allow for about a thousand pixels at a time to take the same path through a shader. Performance is reduced if different directions through a branch need to be taken in small blocks of pixels. With ATI, every block of 16 pixels can take a different path through a shader. On G70 based hardware, blocks of a few hundred pixels should optimally take the same path. NV4x hardware requires larger blocks still - nearer to 900 in size. This tighter granularity possible on ATI hardware gives developers more freedom in how they design their shaders and take advantage of dynamic branching and flow control. Designing shaders to handle 32x32 blocks of pixels is more difficult than only needing to worry about 4x4 blocks of pixels.
After the code is finally scheduled and dispatched, we come to the pixel shader pipeline. ATI tightly groups pixel shaders in quads and is calling each block of pixel pipes a quad pixel shader core. This language indicates the tight grouping of quads that we already assumed existed on previous hardware.
Each pixel pipe in a quad is able to handle 6 instructions per clock. This is basically the same as R4xx hardware except that ATI is now able to accommodate dynamic branching on their dedicated branch hardware. The 2 scalar, 2 vector, 1 texture per clock arrangement seems to have worked with ATI in the past enough for them to stick with it again, only adding 1 branch operation that can be issued in parallel with these 5 other instructions.
Of course, branches won't happen nearly as often as math and texture operations, so this hardware will likely be idle most of the time. In any case, having separate hardware for branching that can work in parallel with the rest of the pipeline does make relatively tight loops more efficient than what they could be if no other work could be done while a branch was being handled.
All in all, one of the more interesting things about the hardware is its modularity. ATI has been very careful to make each block of the chip independent of the rest. With high end hardware, as much of everything is packed in as possible, but with their mid-range solution, they are much more frugal. The X1600 line will incorporate 3 quads with 12 pixel pipes alongside only 4 texture units and 8 Z compare units. Contrast this to the X1300 and its 4 pixel pipes, 4 texture units and 4 Z compare units and the "16 of everything" X1800 and we can see that the architecture is quite flexible on every level.
The general layout of the pipeline is very familiar. We have some number of vertex pipelines feeding through a setup engine into a number of pixel pipelines. After fragment processing, data is sent to the back end for things like fog, alpha blending and Z compares. The hardware can easily be scaled down at multiple points; vertex pipes, pixel pipes, Z compare units, texture units, and the like can all be scaled independently. Here's an overview of the high end case.
The maximum number of vertex pipelines in the X1000 series that it can handle is 8. Mid-range and budget parts incorporate 5 and 2 vertex units respectively. Each vertex pipeline is capable of one scalar and one vector operation per clock cycle. The hardware can support 1024 instruction shader programs, but much more can be done in those instructions with flow control for looping and branching.
After leaving the vertex pipelines and geometry setup hardware, the data makes its way to the "ultra threading" dispatch processor. This block of hardware is responsible for keeping the pixel pipelines fed and managing which threads are active and running at any given time. Since graphics architectures are inherently very parallel, quite a bit of scheduling work within a single thread can easily be done by the compiler. But as shader code is actually running, some instruction may need to wait on data from a texture fetch that hasn't completed or a branch whose outcome is yet to be determined. In these cases, rather than spin the clocks without doing any work, ATI can run the next set of instructions from another "thread" of data.
Threads are made up of 16 pixels each and up to 512 can be managed at one time (128 in mid-range and budget hardware). These threads aren't exactly like traditional CPU threads, as programmers do not have to create each one specifically. With graphics data, even with only one shader program running, a screen is automatically divided into many "threads" running the same program. When managing multiple threads, rather than requiring a context switch to process a different set of instructions running on different pixels, the GPU can keep multiple contexts open at the same time. In order to manage having any viable number of registers available to any of 512 threads, the hardware needs to manage a huge internal register file. But keeping as many threads, pixels, and instructions in flight at a time is key in managing and effectively hiding latency.
NVIDIA doesn't explicitly talk about hardware analogous to ATI's "ultra threading dispatch processor", but they must certainly have something to manage active pixels as well. We know from our previous NVIDIA coverage that they are able to keep hundreds of pixels in flight at a time in order to hide latency. It would not be possible or practical to give the driver complete control of scheduling and dispatching pixels as too much time would be wasted deciding what to do next.
We won't be able to answer specifically the question of which hardware is better at hiding latency. The hardware is so different and instructions will end up running through alternate paths on NVIDIA and ATI hardware. Scheduling quads, pixels, and instructions is one of the most important tasks that a GPU can do. Latency can be very high for some data and there is no excuse to let the vast parallelism of the hardware and dataset to go to waste without using it for hiding that latency. Unfortunately, there is just no test that we have currently to determine which hardware's method of scheduling is more efficient. All we can really do for now is look at the final performance offered in games to see which design appears "better".
One thing that we do know is that ATI is able to keep loop granularity smaller with their 16 pixel threads. Dynamic branching is dependant on the ability to do different things on different pixels. The efficiency of an algorithm breaks down if hardware requires that too many pixels follow the same path through a program. At the same time, the hardware gets more complicated (or performance breaks down) if every pixel were to be treated completely independently.
On NVIDIA hardware, programmers need to be careful to make sure that shader programs are designed to allow for about a thousand pixels at a time to take the same path through a shader. Performance is reduced if different directions through a branch need to be taken in small blocks of pixels. With ATI, every block of 16 pixels can take a different path through a shader. On G70 based hardware, blocks of a few hundred pixels should optimally take the same path. NV4x hardware requires larger blocks still - nearer to 900 in size. This tighter granularity possible on ATI hardware gives developers more freedom in how they design their shaders and take advantage of dynamic branching and flow control. Designing shaders to handle 32x32 blocks of pixels is more difficult than only needing to worry about 4x4 blocks of pixels.
After the code is finally scheduled and dispatched, we come to the pixel shader pipeline. ATI tightly groups pixel shaders in quads and is calling each block of pixel pipes a quad pixel shader core. This language indicates the tight grouping of quads that we already assumed existed on previous hardware.
Each pixel pipe in a quad is able to handle 6 instructions per clock. This is basically the same as R4xx hardware except that ATI is now able to accommodate dynamic branching on their dedicated branch hardware. The 2 scalar, 2 vector, 1 texture per clock arrangement seems to have worked with ATI in the past enough for them to stick with it again, only adding 1 branch operation that can be issued in parallel with these 5 other instructions.
Of course, branches won't happen nearly as often as math and texture operations, so this hardware will likely be idle most of the time. In any case, having separate hardware for branching that can work in parallel with the rest of the pipeline does make relatively tight loops more efficient than what they could be if no other work could be done while a branch was being handled.
All in all, one of the more interesting things about the hardware is its modularity. ATI has been very careful to make each block of the chip independent of the rest. With high end hardware, as much of everything is packed in as possible, but with their mid-range solution, they are much more frugal. The X1600 line will incorporate 3 quads with 12 pixel pipes alongside only 4 texture units and 8 Z compare units. Contrast this to the X1300 and its 4 pixel pipes, 4 texture units and 4 Z compare units and the "16 of everything" X1800 and we can see that the architecture is quite flexible on every level.
103 Comments
View All Comments
HamburgerBoy - Wednesday, October 5, 2005 - link
Seems kind of odd that you'd include nVidia's best but not ATi's.cryptonomicon - Wednesday, October 5, 2005 - link
I was expecting ATI to make a comback here, but the performance is absolutely abysmal in most games. I dont know what else to say except this product is just gonna be sitting in shelves unless the price is cut severely.bob661 - Thursday, October 6, 2005 - link
LOL! I wouldn't say abysmal. Abysmal would be the X1800XT performing like a 6600GT. The card that doesn't do well is the X1600. X1800's are fantastic performers and certainly much better than my 6600GT at displaying all of a games glory. It just wasn't the ass kicker most everyone hyped it up to be. But technically speaking, it IS an ass kicker.flexy - Wednesday, October 5, 2005 - link
i am a bit disappointed - while at work i overflew the other reviews and then, as the crowning end of my day i read the AT review.I (and probably many others) were waiting for this card like it's the best think since sliced bread - and now, WAY too late we do *indeed* have a good card - but a card which is a contender to NV's offerings and nothing groundbreaking.
Don't get me wrong - better AF/AA is something i always have a big eye on, but then ATI always had this slight edge when it came to AF/AA.
The pure performance in FPS itself is rather sobering - just what we're used to the last few years...usually we have TWO high-end cards out which are PRETT MUCH comparable - and no card is really the "sliced bread" thing which shadows all others.
This is kind of sad.
The price also plays a HUGE factor - and amongst the nice AA/AF features i have a hard time to legitimate say spending $500 for "this edge"...especially as someone who already owns a X850XT .
Not as long i am still playable in HL2/DOD/Lost Coats etc....i dont think i will see FPS fall *that quick* - in other words: I can "afford" to wait longer (R580 ?) and wait for appropriate Game engines (UT2K4 ??) which would make it necessary for me to ditch my X850XT because the X850 got "slow".
D3/OpenGL performance is still disappointing - but then i dont know what NV-specific code D3 uses - but still sad to see this card getting it in the face even if it now has SM3.0 and everything.
Availability:
Well..here we go again....
Bottomline: If i were rich and the card would be orderable RIGHT NOW i would get the XT - no question.
But since i am not rich and the card is *a bit* a disappointment and obviously NOT EVEN AVAILABLE - i will NOT get this card.
It's time to sit back, relax, enjoy my current hardware, watch the prices fall, watch the drivers get better...and then, maybe, one day get one of those or A R580 :)
I WISHED it would NOT have been a day making one "sit and relax" but instead burst out in joy and enthusiasm....but well, then this is real life :)
Wesleyrpg - Thursday, October 6, 2005 - link
summarised very well mate!Regs - Wednesday, October 5, 2005 - link
They were likely better off trying to market that we didn't need new video cards this year and save their capital for next year. These performance charts, especially the "mid range" parts are awfully embarrassing to their company.photoguy99 - Wednesday, October 5, 2005 - link
I assume it was not one of the cards that come overclocked stock to 490Mhz?It seems like it would be fair to use a 490Mhz NVidia part since manufacturers are selling them at that speed out of the box with full warrenty intact.
Evan Lieb - Wednesday, October 5, 2005 - link
"Unless you want image quality."There is no image quality difference, and I doubt you've used either card. Fact of the matter is that you'll never notice IQ differences in the vast majority of the games today. Hell, it's even hard to notice differences in slower paced games like Splinter Cell. The reality is that speed is and always will be the number one priority, because eye candy doesn't matter if you're bogged down by choppy frame rate.
Right now, there is zero reason to want to purchase these cards, if you can even find them. That's fact. Accept it and move on until something else is released.
Madellga - Thursday, October 6, 2005 - link
Quality includes also playing a game without shimmering. I can't get that on my 7800GTX.Before anyone replies, the 78.03 drivers improve a lot the problem but does not fix it.
The explanation is inside Derek's article:
"Starting with Area Anisotropic (or high quality AF as it is called in the driver), ATI has finally brought viewing angle independent anisotropic filtering to their hardware. NVIDIA introduced this feature back in the GeForce FX days, but everyone was so caught up in the FX series' abysmal performance that not many paid attention to the fact that the FX series had better quality anisotropic filtering than anything from ATI. Yes, the performance impact was larger, but NVIDIA hardware was differentiating the Euclidean distance calculation sqrt(x^2 + y^2 + z^2) in its anisotropic filtering algorithm. Current methods (NVIDIA stopped doing the quality way) simply differentiate an approximated distance in the form of (ax + by + cz). Math buffs will realize that the differential for this approximated distance simply involves constants while the partials for Euclidean distance are less trivial. Calculating a square root is a complex task, even in hardware, which explains the lower performance of the "quality AF" equation.
Angle dependant anisotropic methods produce fine results in games with flat floors and walls, as these textures are aligned on axes that are correctly filtered. Games that allow a broader freedom of motion (such as flying/space games or top down view games like the sims) don't benefit any more from anisotropic filtering than trilinear filtering. Rotating a surface with angle dependant anisotropic filtering applied can cause noticeable and distracting flicker or texture aliasing. Thus, angle independent techniques (such as ATI's area aniso) are welcome additions to the playing field. As NVIDIA previously employed a high quality anisotropic algorithm, we hope that the introduction of this anisotropic algorithm from ATI will prompt NVIDIA to include such a feature in future hardware as well. "
Phantronius - Wednesday, October 5, 2005 - link
Unless you a fanboy