Intel Q3'05 Roadmap: Conroe Appears, Speculation Ensues
by Kristopher Kubicki & Jarred Walton on August 8, 2005 3:13 AM EST- Posted in
- CPUs
In order to have any inkling of what Conroe will offer, we need to take a step back for a minute. The last truly new architecture that Intel introduced was the IA-64/EPIC platform for Itanium (although depending on how you look at it, some would say that NetBurst actually came after IA-64). Prior to that, Intel had the P6 architecture, which was preceded by P5 (Pentium), 486, 386, etc. all the way back to the first parts Intel made. At present there are three major architectures that are all in production at Intel: P6 (Pentium Pro/II/III now evolved to Pentium M), NetBurst (Pentium 4 and derivatives), and IA-64/EPIC used in Itanium processors. P6 isn't actually the real name of the architecture for Pentium M, of course - Intel has never come forth with an official name. While Pentium M does use an extension of P6, the Banias and Dothan cores really change things quite a bit. We'll talk about how in a moment, but we'll refer to the architecture as P6-M for the remainder of this article. When we say P6-M, we mean Banias, Dothan, and Yonah. Let's take a quick look a the benefits and problems of each architecture before we talk about Conroe.
Prescott is used on the recent Pentium and Celeron chips and has a 31 stage pipeline, coupled to a separate 8 stage fetch/decode front end. (Earlier Northwood and Willamette cores use a 20 stage pipeline with the 8 stage front end.) Together the total pipeline length comes in at 39 stages - over twice the length of the current AMD K8 pipeline. In fact, the next longest pipelines outside of Intel aren't even out yet: Cell and Xenon are both around 21 stages long. The benefits of a long pipeline are in raw clock speeds. It's no surprise that NetBurst is the only chip currently shipping in speeds greater than 3 GHz, and Cell and Xenon are slated to join that "elite" group of processors in the future.
While a lengthy pipeline allows for high clock speeds, it also introduces inefficiencies in cases where a branch prediction misses. When that occurs, everything following the branch instruction has to be cleared from the CPU pipeline and execution begins again - a penalty of as much as 30 cycles in the case of Prescott. (Of course, it could be even longer if there's a cache miss and main memory needs to be accessed, but that delay would occur with or without the branch miss so we'll ignore it.) In order to avoid the full penalty of a branch misprediction (39 cycles), Intel decoupled the fetch/decode unit from the main pipeline and turned the L1 cache into a "trace cache" where instructions are stored in decoded form. The trace cache is actually a very interesting concept and certainly helped improve performance. It basically allows many instructions to skip 1/4 to 1/3 of the standard pipeline. While Intel no longer holds the performance crown, it wasn't until the launch of the K8 that Intel really lost the lead.
In terms of internal functioning of the NetBurst pipeline, each clock cycle at most three traces (instructions decoded into micro-ops) can be issued from the trace cache to the queues within the main pipeline. The NetBurst queues (schedulers) can then dispatch up to six micro-ops per cycle, but there are restrictions and in many cases there are execution slots that can't be filled on any given cycle. Based on the number of traces issued per clock, most would call NetBurst a three-wide issue design. That makes NetBurst the same as AMD's K7/K8 cores as well as the P6/P6-M cores in terms of issue rate. Purely from a theoretical standpoint, NetBurst could execute 3 instructions per clock, multiplied by the clock speed to give the final performance. Nothing ever reaches the theoretical performance of course - if it did, then NetBurst would still be over 35% faster than any other architecture, given its high clock speeds. Branch misses, cache misses, instruction dependencies, etc. all serve to reduce the theoretical performance offered.
Moving on to the Pentium M core, you can find out some of the details of what was changed in our Dothan investigation from last year. The basic idea is to take the P6 core and add some of the latest technologies to the design. To recap the earlier article, the Pentium M has several major design features. First, it goes with a more moderate pipeline length: longer than P6 to allow higher clock speeds, but shorter than NetBurst. (Intel isn't saying more than that, though guesstimates would put the length around 14 to 17 stages.) Next, Intel added micro-ops fusion to the core, which helps some instructions move through the core faster and avoids delays associated with out-of-order cores. Micro-ops fusion in essence eliminates dependency problems on certain instructions, since they are "fused" together. The core also has a dedicated stack manager that helps improve memory access efficiency as well as lower power use. Better branch prediction is another major improvement relative to the P6 design - take something like the branch prediction of NetBurst and put it on the P6 core and that's a rough description of what was done. Branch prediction is one of the features of an architecture that generally makes all code run a bit faster, and it once again reduces inefficiencies. The number of execution units remains the same as in P6, which means there's less wasted power on idle parts of the chip, while the faster system bus of NetBurst helps to keep the processor fed with data. Finally, power saving features were added to the cache, allowing the CPU to only fully power up small areas of the L2 cache for each cache access. The end result is a processor that has certain limitations but ends up achieving a very high performance per Watt rating, which is important for a mobile part. As we've shown in several articles, Pentium M makes for an attractive laptop processor but still can't compete with desktop parts in certain tasks.
Moving on to the final architecture, we come to IA-64/EPIC. While similar in some ways to VLIW (Very Long Instruction Word) architectures of the past, Intel worked to overcome some of the problems with VLIW (specifically the need to recompile code for every processor update) and called their new approach EPIC: "Explicitly Parallel Instruction Computer". In contrast to the P6, NetBurst, K7, and K8 architectures that can issue up to three instructions per cycle, the current Itanium 2 chips can issue six instructions per clock. From a purely theoretical standpoint, the fastest Itanium 2 running at 1.6 GHz actually has more computational power than any other Intel chip. Throw in dual core designs with HyperThreading - HyperThreading that actually works much better than NetBurst HTT due to the wide design of EPIC - and each chip not only has the potential to issue six instructions per clock, but it should actually come relatively close to that number. Another difference between Itanium and the other designs is that large amounts of cache are present in order to keep the pipelines fed with data. Current models ship with up to 9MB of L3 cache, while future parts like the Montecito will have 24MB of L3 cache (and a transistor count of 1.7 billion transistors - about eight times the transistor count of the Pentium D Smithfield core)!
Of course, with the wide issue rate of Itanium 2 (the original Itanium had a 6-wide core as well, but could generally only get 3.5 to 4.0 IPC at best), you need a lot of execution units. NetBurst has 7 execution units in Prescott: two simple integer units (which can function as 4 integer units if you count the double pumped design), a complex integer unit, two FP/SIMD units, and dedicated memory load and store units. If you want to count the simple integer units as 2 each, you could make a stretch and say NetBurst has nine execution units. AMD's K7 and K8 both have nine execution units as well, only they go for a less customized approach and instead have three each of the integer, FP/SIMD, and memory units. Each of AMD's units is fully functional, unlike the "simple" and "complex" integer units in NetBurst. In contrast to these architectures, the current Itanium 2 chips have six ALUs (Arithmetic Logic Units), three BRUs (Branch Units), two FPUs, one SIMD, two load units, and two store units - call it 16 functional units if you prefer, though the specialization of some of them makes it slightly less than that. While Itanium 2 is very wide, the length of the pipeline is only 8 stages - less than any other modern x86 processor by a significant amount. That certainly plays a role in the reduced clock speeds, but like Athlon 64, lower clock speeds with a more efficient architecture can outperform long pipelines in many instances. In order to extract all of the potential performance from Itanium, however, a lot of work needs to be done during code compilation. This is the Achilles' heel of VLIW designs; Processor updates require the code to be recompiled. While EPIC doesn't require that you recompile the code, newer compiler optimizations can improve performance significantly.
All that talk about other Intel architectures (as well as some of AMD), and yet we still haven't said exactly what Conroe is. The simple truth is that no one other than Intel and people under strict NDA really know for sure what the Conroe architecture will entail. There is a point to all of this discussion of previous architectures, though. While we've really only skimmed the surface of the designs, hopefully you can see how wildly different each architecture is from the others. NetBurst is long and narrow, EPIC is short and wide, and P6-M is a medium length pipeline that is narrower than either of the others but requires less power. The high clock speeds and resultant power levels have created problems for NetBurst, but there are still cases where it substantially outperforms P6-M. Itanium is still a better solution for certain types of big business work (databases in particular) than any of the other Intel architectures. While all three architectures have their strong points, none of them qualify as a universally superior solution. Having fallen behind AMD performance in many areas, we seriously doubt that Intel wants to create a design that merely aims at being "faster than AMD in most areas." Whether or not they can succeed is of course a question for the future.
If we don our speculation hats for a minute, we'd say that Conroe will return to more typical pipeline lengths and also reduce the maximum clock speed of the processors based off it relative to NetBurst. A 20 pipeline stage design, give or take, seems to be reasonable - we heard a few people at WinHEC suggest that NetBurst was hubris in terms of pipeline lengths, and that 20 or fewer stages is where all foreseeable pipelines - Intel and otherwise - are heading. The concept of a trace cache also seems to have merit, so some variant of that concept could show up in Conroe - micro-ops fusion plus a trace cache larger than that of NetBurst sounds interesting to us at least, though we're not at all sure it's feasible. Along with the shorter, more efficient pipeline, Conroe could also look into going to a wider issue rate. Some people have argued (rather convincingly) that x86 code is not conducive to issuing more than 3 instructions per clock without expending significant die resources, however, and current designs rarely manage issuing three instructions per clock anyway. A better solution could be to simply add more execution units, branch prediction, prefetch logic, etc. to ensure that the core can actually reach the maximum issue rate more frequently. Taking something like Pentium M and adding more FP/SIMD computational power isn't too much of a stretch (though that seems to be where Yonah is already heading).
If any of these ideas make the final design of Conroe, it's basically just an educated guess. The main point right now is that new architectures from Intel are not a frequent occurrence, so we expect it to be substantially different than P6/P6-M, NetBurst, and EPIC. Depending on how much collaboration there is between the various CPU design teams, we could see many elements of all three architectures or we could see a design largely derived from one or two of the others. If you consider that the Northwood to Prescott changes were pretty significant and Intel still didn't dub Prescott a new architecture, Conroe (and derivatives) ought to be a more significant redesign than going from 20 to 31 pipeline stages, adding EM64T, and changing cache sizes. Chances are that by the time we know more, we'll be under NDA as well until the official launch, so consider this our last chance at some enthusiast speculation.
33 Comments
View All Comments
IntelUser2000 - Monday, August 22, 2005 - link
First Itanium is 6-wideItanium 2 is 6-wide
Itanium 2 doesn't increase the issue rate, what Itanium 2 does is increase the possibility that IPC of 6 is possible by making better architecture.
Brief overview of Itanium architecture: The CPU processes the EPIC instructions by using two bundles of 3 instructions each, therefore achieving IPC of 6. Each bundle can have a certain combination of different instructions.
Main execution units in Itanium consists of 4 different kinds, that is Branch unit, Floating Point Unit, Memory Unit, Integer Unit. Memory and Integer unit can be considered in simple terms as ALU from what I understand.
In one bundle, you can have certain combinations of those execution units. Examples may be: MMI(memory, memory, integer), MII, MIF, BBB, and such. Remember that each bundle can have that combinations, and there is like 26 combinations or so. That means if the second bundle can't have that combinations due to the lack of execution units, 6-wide isn't possible.
Itanium had 2 M units, 2 I units, 2 FP units, 3 B units. So if the first bundle is MMI and the second bundle is MMI, it can't have 6-wide execution.
According to the article I read, first Itanium can have in theory of ~3.8 IPC due to lack of execution units, and Itanium 2 have theoretical IPC of 5.6-5.7 due to more execution units, specifically 4 M units rather than 2 as in Itanium.
There are two kind of ways to run 32-bit for Itanium. One way is the hardware emulator that's in all current Itanium chips. The 32-bit performance for first Itanium runs 32-bit x86 code as worse as 66MHz 486, or good as 200MHz Pentium MMX, when Itanium is running at 800MHz. Itanium 2 has better hardware 32-bit emulator plus better overall Itanium architecture, so 32-bit performance increases to around equal to 300MHz Pentium II(1GHz Itanium 2 has twice the performance or better compared to 800MHz Itanium in native code). That's pretty bad, makes running 32-bit practically useless, and according to the review, the compatibility was not so good either, as Quake 3 wouldn't install(not that running Quake 3 on Pentium 100MHz equivalent isn't sort of a push). Plus it takes additional die space and power consumption, which is not that much but a lot for a almost useless feature.
So Intel introduced a dynamic software translator for the Itanium called IA-32EL(Execution Layer). By translating x86 instructions to EPIC instructions and optimizing them on run-time, performance improved dramatically while, taking out the need to have hardware emulator. 1.5GHz Itanium 2 with 6MB L3 cache is now equal to equivalently clocked Xeon MP(with hardware it would have been equal to 450MHz Pentium II) or better, which isn't that bad, and much better than the hardware one.
Montecito seems to not have the hardware emulator anymore.
JarredWalton - Tuesday, August 23, 2005 - link
Dang, I *swear* I read an article on HP.com or Intel.com stating Itanium 2 was 8-wide. I can't find it anymore, but there are many saying 6-wide. Weird. Anyway, I've read plenty about the rest of the Itanium architecture, and I don't know why you're suddenly going off about it. I'll correct the issue width statement, though.Not like it matters now, as we all know Conroe is 4-wide now. (I really expected that to be the case, but was told to make it less certain and more speculative for the article.)
IntelUser2000 - Thursday, August 25, 2005 - link
http://www.intel.com/design/itanium2/datashts/2509...">http://www.intel.com/design/itanium2/datashts/2509...The intro shows that its 6-wide, 8-stage pipeline deep architecture. 8 does stand for something but I forgot what. I babbled on because it wasn't directed all at you, but I hoped somebody who didn't know and want to know may look at it.
JarredWalton - Friday, August 26, 2005 - link
Argh! WTF is going on? Am I senile? I'm positive I read something about Itanium 2 (McKinley, etc.) being more than 8 pipeline stages. It stated something about the 8 stages of Merced being part of the reason Itanium 1 never reached higher clock speeds. Damn... people must just make stuff up about these architectures. :|IntelUser2000 - Thursday, September 1, 2005 - link
Itanium "Merced" is 10 stage pipelines. Nearly everyone that looked at the architecture said it was a bloated design, that was released in haste. By improving design tremendously over Itanium, Itanium 2 Mckinley reduces that to 8 stage pipeline while clocking 25% higher at the SAME process.
Itanium-800MHz, 0.18 micron, 10 stage pipeline, 9 stage branch miss stages
Itanium 2-1GHz, 0.18 micron, 8 stage pipeline, 7 stage branch miss stages
nserra - Friday, August 19, 2005 - link
I agree IntelUser2000, but even so, if each core used c&q with some disable core capability, would be in the 30W per core range (120W total) right on track with prescott 2M and Pentium D.I don’t know if you noticed, but amd added more power to their designs while their processor are consuming less.... that must be because:
Good reasons first:
-amd will achieve higher clock speeds 3.4 GHz and up
-amd is already thinking in 4 cores processors
Bad reasons:
-amd will come with some bad 65nm tech
-or will come with some bad core (M2 with rev.F prescott like)
dwalton - Monday, August 15, 2005 - link
"Intel Q3'05 Roadmap: Conroe Appears, Speculation Ensues"
I almost spit my coffee onto the keyboard when i read that title. Came off to me as Intel released a roadmap showing the Conroe release in the third quarter of this year.
JarredWalton - Monday, August 15, 2005 - link
Sorry to disappoint. :pIntel's lead time on the roadmap is about 18 months, though the initial details are often lacking. With Conroe/Merom being a new architecture, I doubt Intel will do so much as mention a clock speed without NDAs.
IntelUser2000 - Friday, August 12, 2005 - link
Intel's 45nm is supposed to signal high-K, metal gates, and possibly tri-gate transistor structure. By using tri-gate, its supposed to be fully depleted substrate from the start. So, if they implement what they say they will according to their presentations:-High-K
-Metal
-Tri-gate, which brings FD-SOI
We should see Yonah before worrying about Conroe. The specs of Yonah is pretty interesting.
JarredWalton - Saturday, August 13, 2005 - link
Yonah looks interesting in some ways, but as far as I can tell it's just Dothan on 65nm with dual cores, improved uops-fusion, and hopefully better FP/SIMD support. I haven't even heard anything to indicate it will have 64-bit extensions, which makes it less than Conroe in my book. Not that 64-bit is the be-all, end-all, but I'm pretty sure I've bought my last 32-bit CPU now. I'd hate to get stuck upgrading for Longhorn just because I didn't bother with a 64-bit enabled processor. Bleh... Longhorn and 64-bit is really just hype anyway, but we'll be forced that way like it or not. Hehehe.