Intel Core 2 Extreme QX9650 - Penryn Ticks Ahead
by Anand Lal Shimpi on October 29, 2007 12:13 AM EST- Posted in
- CPUs
Diving Deeper: SSE4 Performance
One of Penryn's real strengths is in its support for SSE4, which has the ability to really provide a tremendous performance advantage for some time to come. Unfortunately, as is usually the case with new instructions, it's going to take a while for applications to actually utilize them. Such is the case with SSE4 as the only benchmarks we're able to bring you come directly from Intel, but thankfully they are of real world usage models. Both tests we've actually showed you in the past, during Intel's own sanctioned Penryn previews, and both involve some sort of encoding.
The most important test is a DivX encode using VirtualDub 1.7.6 and DivX 6.7. SSE4 comes in if you choose to enable a new full search algorithm for motion estimation, which is accelerated by two SSE4 instructions: MPSADBW and PHMINPOSUW. The idea is that motion estimation (figuring out what will happen in subsequent frames of video) requires a lot of computation of sums of absolute differences, as well as finding the minimum values of the results of those computations. The SSE2 instruction PSADBW can compute two sums of differences from a pair of 16B unsigned integers; the SSE4 instruction MPSADBW can do eight.
According to Intel's own research on motion estimation with SSE4, the same search algorithm can take 71 cycles per 16x16 pixel block using the SSE2 SAD (sum of abs differences) instruction, compared to only 26 cycles using the SSE4 version. The latency reduction results in an obvious performance increase.
We used VirtualDub 1.7.6 and DivX 6.7 with SSE4 Full Search enabled to measure the impact of this motion estimation optimization. Note that the motion estimation that's taking place here is more accurate than the default DivX setting, so both SSE4 and SSE2 versions of the algorithm result in slower performance (but better quality) than with it disabled.
SSE2 Search | SSE4 Search | |
Intel Core 2 Extreme QX9650 (3.0GHz) | 21.9 seconds | 15.1 seconds |
Intel Core 2 Extreme QX6850 (3.0GHz) | 35.2 seconds | N/A |
On our QX9650, the full search with SSE4 enabled runs about 45% faster than with SSE2 only - impressive! Note also that the Penryn QX9650 offers better SSE2 performance in this test as well, coming in about 61% faster than the QX6850. The total performance increase from QX6850 SSE2 to QX9650 SSE4 in this test is an incredible 133%. Obviously, this is not going to be the norm in many other applications, but there's definitely some potential for meaningful optimizations in certain applications.
It's important to note that the PHMINPOSUW instructions doesn't appear to be in AMD's proposed SSE5 specification, although MPSADBW looks like it'll make it. AMD will eventually add full SSE4 support to its processors but not until the 2009/2010 time frame from what we've heard.
Our second benchmark from Intel is an MPEG-2 encode of an HD video using TMPGEnc 4.0.
TMPGEnc 4.0 | |
Intel Core 2 Extreme QX9650 (3.0GHz) | 103 seconds |
Intel Core 2 Extreme QX6850 (3.0GHz) | 135 seconds |
The performance difference is a little less significant here, with the SSE4-less QX6850 taking about 31% more time to encode the input file than the QX9650.
Both of these are very real-world implementations of SSE4; unfortunately, it's tough to say how long it will be before we see widespread use of the new instructions.
16 Comments
View All Comments
Canadian87 - Monday, October 29, 2007 - link
I'd like to point out that someone must have been tired when writing this. The graphs here on page 4 say "QX6950" VS "QX6850", simple reversal of the numbers, but I'd like to correct it for those that might be confused, took me a moment to figure out which was which myself the "QX6950" is ment to be the "QX9650", and obviously the "QX6850" is the correct naming.GL HF.
GlassHouse69 - Monday, October 29, 2007 - link
ew.intel again ftw. blech. They made a great chip. power usage is fantastic. One could get even lower total wattages (by far) if they concentrated on doing so. a quad core that can be cooled near silently. neat :)
sprockkets - Monday, October 29, 2007 - link
Just a question, what was the difference from Core to Core 2? All I could ever fine was cache size was increased.Now that I'm thinking about it, why not the name Quadro? Oh, nVidia ownz it.
defter - Monday, October 29, 2007 - link
Core Duo (Yonah) was based on Pentium M.Core2 (Conroe) is a new architecture.
sprockkets - Monday, October 29, 2007 - link
actually i found a comparison page about it, and core 2 isn't that much different from core. Yes, it updated a lot and gave improved performance. No, it is not a completely new architecture from PM, but you can say a big difference from the P4.http://www.anandtech.com/showdoc.aspx?i=2808&p...">http://www.anandtech.com/showdoc.aspx?i=2808&p...
sprockkets - Monday, October 29, 2007 - link
On page 9 I believe you are grabbing some old benchmarks, old in the sense of your previous articles. I believe I pointed this out to you as a mistake, and now it is here in the bar graph. Again, how is it that the 2.33ghz C2D outperforms the 3ghz one?