Despite being extremely well prepared in having Nehalem, motherboards, coolers and memory well before launch, the run up to the NDA lift of Intel's Core i7 processors was stressful. There was so much to test: multi-GPU compatibility with X58, memory controller performance, general application performance, overclocking, Hyper Threading, etc...
We're all still hard at work on sorting out the details, Gary is working on a X58 motherboard roundup and has been testing 12GB memory configurations for the past several days (as well as working with board vendors to improve performance/compatibility with 12GB but I'll let you tell him about that), Derek is working on multi-GPU performance and Kris has been working on an overclocking guide. What have I been up to? Well, I've been trying to answer a few lingering questions about Nehalem.
What I've got today are the first results of the questions I've been asking, I've spent the past week looking at power efficiency, memory latency and talking to some of Intel's finest on the phone about Nehalem. And I'm back to report, gather 'round for Nehalem: The Unwritten Chapters.
The Uncore
I got a little more detail from Intel on the un-core clock. Just like Phenom, Intel’s Core i7 is divided into an area called the “core” and an area called the “uncore”. The core contains the individual processor cores and their L1/L2 caches, while the uncore houses the memory controller and the shared L3 cache. In our review I mentioned that the uncore runs at 2.66GHz, which is true, but only for the Core i7-965. The Core i7-940 and 920 both run the uncore at 2.13GHz.
The uncore clock is defined by Intel just like the core clock is - Intel sets it based on yield and performance targets. As I mentioned in the launch review, the uncore clock runs at a simple multiplier of the bclk (133MHz): 20x for the i7-965 and 16x for the i7-940/920. The uncore also runs at its own voltage (1.20V) and that voltage doesn't scale up/down.
On Intel’s own X58 board the uncore clock is configured on the memory settings page and is simply called UCLK:
I took the i7-965, ran it at 2.66GHz to simulate an i7-920, and varied the uncore clock to measure the impact in L3 cache and memory latency:
Core Clock | Uncore Clock | L3 Latency | Main Memory Latency | x264 HD Benchmark | Cinebench XCPU Benchmark |
2.66GHz | 2.93GHz | 34 cycles | 143 cycles | 72.8 fps | 13456 |
2.66GHz | 2.66GHz | 36 cycles | 148 cycles | 73.0 fps | 13429 |
2.66GHz | 2.13GHz | 41 cycles | 159 cycles | 72.7 fps | 13182 |
At a 2.66GHz uncore clock things seem to hit a sweet spot, although the translation to real-world performance just isn't there. Perhaps in a very memory intensive test we'd see something more pronounced, but even the x264 HD encoding test showed no performance difference between the three uncore clock speeds.
Surprisingly enough, I couldn’t get the i7-965’s uncore to hit 3.2GHz - Vista would bluescreen before I could even get to the desktop (note that the Intel X58 board I was using did not support adjusting the uncore voltage, so it remained at stock). As the table above shows, increases in uncore frequency aren't nearly as useful as increasing the CPU frequency. Intel recognized this performance relationship as well and chose to optimize the uncore for power consumption, not clock speed, which means that the uncore won't be able to clock as high as the core itself. You could always increase the voltage a lot to try and boost uncore speed but right now it's not looking like the tradeoff would be worth it as you'd increase power quite a bit.
23 Comments
View All Comments
Denithor - Saturday, November 8, 2008 - link
HT works well on i7 because of two things: software is much more multithreaded today and there have been drastic throughput & memory controller improvements in the generations from Netbust to Nehalem.Multithreaded applications can be accelerated hugely by pulling resources from multiple cores to work on one application (whether physical or virtual cores doesn't matter).
HT on Netbust was like fitting a garden hose onto a fire hydrant. The data just backed up and couldn't feed through the pipe smoothly. On i7 the bandwidth and memory controller have been optimized to improve flow so the cores don't sit idle (HT basically levels the flow of work across the cores so they all stay busy).
TA152H - Saturday, November 8, 2008 - link
Actually, you're probably missing the point that Nehalem is a lot wider than the Pentium 4 was. Consequently for any given clock cycle, you have more execution resources available for two threads that are probably not used, and could be with an additional thread.Most of the time, the data is read from the L1 cache, or, at worst, the L2 cache, so the memory throughput isn't going to be a huge problem most of the time. But, then again, the i7 has a bigger L1 cache, which probably helps as well. It's very slow though, and it makes you wonder why they shackled this processor with a very slow L1 cache (the same clock as a Pentium 4, but with much lower clock speed design). I mean, it can't clock higher than the Penryn, and the cache isn't any bigger, so does it need to be 33% slower? Power savings are nice, but not for a 33% slower L1 cache.
Also, I'm curious why Intel gave up on the Pentium 4 before the 45 nm production. If you think about it, the drastically lower power use of this manufacturing technology would have yielded enormous improvements in clock speed (since the limitation for it was not based on transistor switching speed, but on the power/heat). I don't think there's any doubt they'd be running over 6 GHz, and with some effective tweaks (and undoing some of the Prescott's damager) it might be an interesting processor. Probably not though, but I'm a little curious how it would pan out.
ltcommanderdata - Saturday, November 8, 2008 - link
Yes, I think HT fits well with Nehalem because of the increased execution resources, 3 ALUs, 2 FPUs, and 3 SSE units compared to 3 ALUs and 2 FPU/SSE units in Netburst. Although I think HT serves a different purpose in each design. Netburst didn't have as much memory bandwidth and it's latency was higher so HT served to hide that, while Nehalem has plenty of memory bandwidth and execution resources and HT serves to best take advantage of those resources.In regards to the high cache latency, I have to agree. I have yet to see an explanation of where the high L1 cache latency comes from. And the L2 cache latency is similarly unimpressive considering Dothan had a 2MB L2 cache per core with a 10 cycle latency while Nehalem's 256KB L2 cache per core has higher latency at 11 cycles. Granted that perhaps having a L3 cache forces limitations on the caches, but I still think the latencies are quite high. No offense to the Oregon team, but the last time they did a microarchitecture refresh in Prescott they increased the P4's L1 cache latency from 2 cycles in Northwood to 4 cycles in Prescott and the L2 latency from 16 cycles to 23 cycles so it's disconcerting that they've increased the L1 cache latency from 3 cycles in Penryn to the same 4 cycles in Nehalem, decreased the L2 cache size from 6MB to 256KB to only gain 4 cycles to 11 cycles, and added a 39 cycle L3 cache. I don't think latencies will improve in Westmere, but hopefully they can double the L2 cache to 512KB without increasing latencies and similarly increase the L3 cache, probably to 12MB, without increasing latencies. And maybe latencies can improve in the next microarchitecture refresh in Sandy Bridge with the return of the Israeli team.
And I also agree that the P4 could probably still have hope with the 45nm process. Even at the 65nm process, Presler still had potential. With the Pentiumm Extreme Edition 965, Intel had basically caught up with the power consumption of it's competitor the FX-60. And things actually improved over time, if you looked at the original Presler B1 stepping Intel was only able to reach 3GHz in the 930D at a 95W TDP, while by the last D0 stepping released after Conroe, Presler was able to reach 3.6GHz in the 960D under the same 95W TDP. Under the same process, a 20% increase in clock speed for the same power consumption is impressive for any micro-architecture, and especially Netburst.
Clearly, the 65nm process could have brought Netburst's power consumption under control, but by that time development focus had long already shifted to Merom which is why Presler/Ceder Mill was only a shrink rather than a redesign of Prescott. I guess we'll never know what could have happened if Intel had actually used Presler to correct Prescott's flaws such as reducing cache latency, adding a 2nd instruction decoder to keep the Trace Cache and execution units fed, introducing a native dual core design like Yonah over Dothan, etc. But I think the Merom strategy was in the end better since even with a redesign to improve performance, Netburst would probably always have power consumption on the high-end of acceptible, and would have never been fit for mobile usage which is where consumer focus is shifting.
IntelUser2000 - Saturday, November 8, 2008 - link
Don't complain with the lack of single thread increase. Where do you think the majority of the performance increase in Core 2 came from?? It's not a new idea, it just has better memory parallelism(memory disambiguation, excellent prefetchers).Future IS MULTI-THREAD. Single thread brings minimal performance increase. For gamers who care, GPU does far more than CPU and multi-threading increases things in things that really matter.
Westmere isn't gonna bring large L2 caches, L3 caches will increase but that's because the core count is going to 6 cores. Sandy Bridge will bring per core L2 cache to 512KB, but how much do you think that'll do?? It's at most 5-10%.
The ways to increase x86 CPU performance is decreasing. This is the reason Sandy Bridge will bring advanced Turbo Mode implementation for single threaded performance.
ltcommanderdata - Sunday, November 9, 2008 - link
I wasn't aware that I was complaining about single-threaded performance in my previous posts.And another important thing that Sandy Bridge is bringing is AVX. SIMD doesn't benefit all programs, but it does increase performance of optimized applications regardless of whether they are single-threaded or multi-threaded.
SiXiam - Saturday, November 8, 2008 - link
"The Q9450 can operate at voltages down to 0.85V and as high as 1.3625V, while the Core i7-920 currently appears to be limited to a minimum of around 1.137V."- I just wanted to let everyone know that benchmarkreviews.com got the i7 920 at stock speeds with 1.125volts.
2.66 GHz @ 1.125v 133mhz x20
http://benchmarkreviews.com/index.php?option=com_c...">http://benchmarkreviews.com/index.php?o...Itemid=6...
Denithor - Friday, November 7, 2008 - link
Great article. Very impressive results here, congrats to the i7 design team. Of course, we all said the same thing when C2D was launched, with a much bigger differential in performance/watt versus the "Netbust" architecture.Have you guys tried F@H SMP client on these i7 chips yet? I'm curious how they stack up against the Q9xx0 series in raw performance. Do the multithreading improvements help put CPU folding any closer to GPU folding or will GPU continue to reign supreme?
Does Intel intend to launch dual-core versions of these processors or will this generation be quad only?
Finally, for myself, I have an e8400 and an e3110 which are more than adequate for my current needs. I doubt I will even bother with one of these new setups, I'll just wait until Westmere and the 32nm improvements (higher clocks, lower power, heat and probably price).
Strid - Friday, November 7, 2008 - link
Yeah, I agree. While the offer a solid quad-core performance, and possibly also with a decent energy efficiency, they're not much use for a guy like me who doesn't use much of that multi-core jazz.They might not chew up more watts than QX9770, but QX9770 still is a lot more hungry than even the currently quickest 45 nm dual core (E8600). Any news as to a dual-core'd version of Nehalem yet? I'll stick to my Xeon E3110 until then.
tynopik - Friday, November 7, 2008 - link
> (I will be working on a Hyper Threading/multi-tasking set of tests next).looking forward to it!
(and then the VM tests ;)
cpugeek - Friday, November 7, 2008 - link
I think anandtech fail to mention about QPI vs FSB. QPI is super power hungry and offset a lot of power reduction done by Intel. Thats why Lynfield/clarkfield will be much better power efficient since they didn't use QPI physical layer to talk with chipset/tylesburg.