Intel Q3'05 Roadmap: Conroe Appears, Speculation Ensues
by Kristopher Kubicki & Jarred Walton on August 8, 2005 3:13 AM EST- Posted in
- CPUs
In order to have any inkling of what Conroe will offer, we need to take a step back for a minute. The last truly new architecture that Intel introduced was the IA-64/EPIC platform for Itanium (although depending on how you look at it, some would say that NetBurst actually came after IA-64). Prior to that, Intel had the P6 architecture, which was preceded by P5 (Pentium), 486, 386, etc. all the way back to the first parts Intel made. At present there are three major architectures that are all in production at Intel: P6 (Pentium Pro/II/III now evolved to Pentium M), NetBurst (Pentium 4 and derivatives), and IA-64/EPIC used in Itanium processors. P6 isn't actually the real name of the architecture for Pentium M, of course - Intel has never come forth with an official name. While Pentium M does use an extension of P6, the Banias and Dothan cores really change things quite a bit. We'll talk about how in a moment, but we'll refer to the architecture as P6-M for the remainder of this article. When we say P6-M, we mean Banias, Dothan, and Yonah. Let's take a quick look a the benefits and problems of each architecture before we talk about Conroe.
Prescott is used on the recent Pentium and Celeron chips and has a 31 stage pipeline, coupled to a separate 8 stage fetch/decode front end. (Earlier Northwood and Willamette cores use a 20 stage pipeline with the 8 stage front end.) Together the total pipeline length comes in at 39 stages - over twice the length of the current AMD K8 pipeline. In fact, the next longest pipelines outside of Intel aren't even out yet: Cell and Xenon are both around 21 stages long. The benefits of a long pipeline are in raw clock speeds. It's no surprise that NetBurst is the only chip currently shipping in speeds greater than 3 GHz, and Cell and Xenon are slated to join that "elite" group of processors in the future.
While a lengthy pipeline allows for high clock speeds, it also introduces inefficiencies in cases where a branch prediction misses. When that occurs, everything following the branch instruction has to be cleared from the CPU pipeline and execution begins again - a penalty of as much as 30 cycles in the case of Prescott. (Of course, it could be even longer if there's a cache miss and main memory needs to be accessed, but that delay would occur with or without the branch miss so we'll ignore it.) In order to avoid the full penalty of a branch misprediction (39 cycles), Intel decoupled the fetch/decode unit from the main pipeline and turned the L1 cache into a "trace cache" where instructions are stored in decoded form. The trace cache is actually a very interesting concept and certainly helped improve performance. It basically allows many instructions to skip 1/4 to 1/3 of the standard pipeline. While Intel no longer holds the performance crown, it wasn't until the launch of the K8 that Intel really lost the lead.
In terms of internal functioning of the NetBurst pipeline, each clock cycle at most three traces (instructions decoded into micro-ops) can be issued from the trace cache to the queues within the main pipeline. The NetBurst queues (schedulers) can then dispatch up to six micro-ops per cycle, but there are restrictions and in many cases there are execution slots that can't be filled on any given cycle. Based on the number of traces issued per clock, most would call NetBurst a three-wide issue design. That makes NetBurst the same as AMD's K7/K8 cores as well as the P6/P6-M cores in terms of issue rate. Purely from a theoretical standpoint, NetBurst could execute 3 instructions per clock, multiplied by the clock speed to give the final performance. Nothing ever reaches the theoretical performance of course - if it did, then NetBurst would still be over 35% faster than any other architecture, given its high clock speeds. Branch misses, cache misses, instruction dependencies, etc. all serve to reduce the theoretical performance offered.
Moving on to the Pentium M core, you can find out some of the details of what was changed in our Dothan investigation from last year. The basic idea is to take the P6 core and add some of the latest technologies to the design. To recap the earlier article, the Pentium M has several major design features. First, it goes with a more moderate pipeline length: longer than P6 to allow higher clock speeds, but shorter than NetBurst. (Intel isn't saying more than that, though guesstimates would put the length around 14 to 17 stages.) Next, Intel added micro-ops fusion to the core, which helps some instructions move through the core faster and avoids delays associated with out-of-order cores. Micro-ops fusion in essence eliminates dependency problems on certain instructions, since they are "fused" together. The core also has a dedicated stack manager that helps improve memory access efficiency as well as lower power use. Better branch prediction is another major improvement relative to the P6 design - take something like the branch prediction of NetBurst and put it on the P6 core and that's a rough description of what was done. Branch prediction is one of the features of an architecture that generally makes all code run a bit faster, and it once again reduces inefficiencies. The number of execution units remains the same as in P6, which means there's less wasted power on idle parts of the chip, while the faster system bus of NetBurst helps to keep the processor fed with data. Finally, power saving features were added to the cache, allowing the CPU to only fully power up small areas of the L2 cache for each cache access. The end result is a processor that has certain limitations but ends up achieving a very high performance per Watt rating, which is important for a mobile part. As we've shown in several articles, Pentium M makes for an attractive laptop processor but still can't compete with desktop parts in certain tasks.
Moving on to the final architecture, we come to IA-64/EPIC. While similar in some ways to VLIW (Very Long Instruction Word) architectures of the past, Intel worked to overcome some of the problems with VLIW (specifically the need to recompile code for every processor update) and called their new approach EPIC: "Explicitly Parallel Instruction Computer". In contrast to the P6, NetBurst, K7, and K8 architectures that can issue up to three instructions per cycle, the current Itanium 2 chips can issue six instructions per clock. From a purely theoretical standpoint, the fastest Itanium 2 running at 1.6 GHz actually has more computational power than any other Intel chip. Throw in dual core designs with HyperThreading - HyperThreading that actually works much better than NetBurst HTT due to the wide design of EPIC - and each chip not only has the potential to issue six instructions per clock, but it should actually come relatively close to that number. Another difference between Itanium and the other designs is that large amounts of cache are present in order to keep the pipelines fed with data. Current models ship with up to 9MB of L3 cache, while future parts like the Montecito will have 24MB of L3 cache (and a transistor count of 1.7 billion transistors - about eight times the transistor count of the Pentium D Smithfield core)!
Of course, with the wide issue rate of Itanium 2 (the original Itanium had a 6-wide core as well, but could generally only get 3.5 to 4.0 IPC at best), you need a lot of execution units. NetBurst has 7 execution units in Prescott: two simple integer units (which can function as 4 integer units if you count the double pumped design), a complex integer unit, two FP/SIMD units, and dedicated memory load and store units. If you want to count the simple integer units as 2 each, you could make a stretch and say NetBurst has nine execution units. AMD's K7 and K8 both have nine execution units as well, only they go for a less customized approach and instead have three each of the integer, FP/SIMD, and memory units. Each of AMD's units is fully functional, unlike the "simple" and "complex" integer units in NetBurst. In contrast to these architectures, the current Itanium 2 chips have six ALUs (Arithmetic Logic Units), three BRUs (Branch Units), two FPUs, one SIMD, two load units, and two store units - call it 16 functional units if you prefer, though the specialization of some of them makes it slightly less than that. While Itanium 2 is very wide, the length of the pipeline is only 8 stages - less than any other modern x86 processor by a significant amount. That certainly plays a role in the reduced clock speeds, but like Athlon 64, lower clock speeds with a more efficient architecture can outperform long pipelines in many instances. In order to extract all of the potential performance from Itanium, however, a lot of work needs to be done during code compilation. This is the Achilles' heel of VLIW designs; Processor updates require the code to be recompiled. While EPIC doesn't require that you recompile the code, newer compiler optimizations can improve performance significantly.
All that talk about other Intel architectures (as well as some of AMD), and yet we still haven't said exactly what Conroe is. The simple truth is that no one other than Intel and people under strict NDA really know for sure what the Conroe architecture will entail. There is a point to all of this discussion of previous architectures, though. While we've really only skimmed the surface of the designs, hopefully you can see how wildly different each architecture is from the others. NetBurst is long and narrow, EPIC is short and wide, and P6-M is a medium length pipeline that is narrower than either of the others but requires less power. The high clock speeds and resultant power levels have created problems for NetBurst, but there are still cases where it substantially outperforms P6-M. Itanium is still a better solution for certain types of big business work (databases in particular) than any of the other Intel architectures. While all three architectures have their strong points, none of them qualify as a universally superior solution. Having fallen behind AMD performance in many areas, we seriously doubt that Intel wants to create a design that merely aims at being "faster than AMD in most areas." Whether or not they can succeed is of course a question for the future.
If we don our speculation hats for a minute, we'd say that Conroe will return to more typical pipeline lengths and also reduce the maximum clock speed of the processors based off it relative to NetBurst. A 20 pipeline stage design, give or take, seems to be reasonable - we heard a few people at WinHEC suggest that NetBurst was hubris in terms of pipeline lengths, and that 20 or fewer stages is where all foreseeable pipelines - Intel and otherwise - are heading. The concept of a trace cache also seems to have merit, so some variant of that concept could show up in Conroe - micro-ops fusion plus a trace cache larger than that of NetBurst sounds interesting to us at least, though we're not at all sure it's feasible. Along with the shorter, more efficient pipeline, Conroe could also look into going to a wider issue rate. Some people have argued (rather convincingly) that x86 code is not conducive to issuing more than 3 instructions per clock without expending significant die resources, however, and current designs rarely manage issuing three instructions per clock anyway. A better solution could be to simply add more execution units, branch prediction, prefetch logic, etc. to ensure that the core can actually reach the maximum issue rate more frequently. Taking something like Pentium M and adding more FP/SIMD computational power isn't too much of a stretch (though that seems to be where Yonah is already heading).
If any of these ideas make the final design of Conroe, it's basically just an educated guess. The main point right now is that new architectures from Intel are not a frequent occurrence, so we expect it to be substantially different than P6/P6-M, NetBurst, and EPIC. Depending on how much collaboration there is between the various CPU design teams, we could see many elements of all three architectures or we could see a design largely derived from one or two of the others. If you consider that the Northwood to Prescott changes were pretty significant and Intel still didn't dub Prescott a new architecture, Conroe (and derivatives) ought to be a more significant redesign than going from 20 to 31 pipeline stages, adding EM64T, and changing cache sizes. Chances are that by the time we know more, we'll be under NDA as well until the official launch, so consider this our last chance at some enthusiast speculation.
33 Comments
View All Comments
IntelUser2000 - Thursday, August 18, 2005 - link
And can u tell me how that's not significant?? Yonah isn't like Smithfield's slap-on dual core, because it has arbitration logic to manage data between two cores. And even compared to A64 dual core, its not just dual core + SRQ-like, it has bunch of other enhancements which strengthen the weakness(FPU/SSE).
To: nserra
HT takes less than 5% die size, of course IMC is good, but Pentium 4 can have IMC too. I think HT and IMC is good in their own different ways.
Cache consumes low power, and takes little die space compared to number of transistors used. If you take 4 core on Athlon64 today on the 90nm, Prescott will look cool running compared to it.
The 6MB cache in Itanium 2 takes 60% die size but only 30% power consumption.
nserra - Friday, August 19, 2005 - link
I agree IntelUser2000, but even so, if each core used c&q with some disable core capability, would be in the 30W per core range (120W total) right on track with prescott 2M and Pentium D.I don’t know if you noticed, but amd added more power to their designs while their processor are consuming less.... that must be because:
Good reasons first:
-amd will achieve higher clock speeds 3.4 GHz and up
-amd is already thinking in 4 cores processors
Bad reasons:
-amd will come with some bad 65nm tech
-or will come with some bad core (M2 with rev.F prescott like)
coldpower27 - Saturday, August 13, 2005 - link
Yeh, from current rumors Yonah is having every check box feature besides EM64T :)Like I have said I can't wait till Intel brings Conroe technology, as I always have like going the Intel route, but I don't want to go for NetBurst based processors.
45nm generation looks to be quite the change for Intel, as they are moving to those tri-gate trasistors, High K, and FD-SOI, tohugh I beleive it would be introduced at the end of 2007 rather then mid at the earliest, Conroe is expected to debut on 65nm technology, hopefully it doens't need to get an optical shrink to get good like NetBurst did and is good from the get go, like Athlon 64 was.
IntelUser2000 - Friday, August 12, 2005 - link
I heard due to the limits of the Trace Cache throughput, it only can achieve IPC of 2, not 3, so even in theory, Pentium 4 only reaches IPC of 2.About the Hyperthreading technology, I sort of disagree. If the design of the microprocessor is made to accomodate such multi-threading technology, they don't need to put 24% increase in die size like Power 5 did. I heard only with 5% increase in die size, Alpha EV8 was supposed to have performance increase of 2x, which happens to be greater than by putting another core!!!
Pentium 4's HT takes LESS than 5% die size.
nserra - Wednesday, August 17, 2005 - link
Well if you think that the 5% die for HT is very well spent, what about the 5% of the AMD Athlon64 on the integrated memory controller.JarredWalton - Saturday, August 13, 2005 - link
In practice, I'd guess that NetBurst averages an IPC of around 1.3 overall. I'd say Athlon 64 is closer to 2.0. Obviously just a guess, but when you consider how a 2.4 GHz A64 3800+ compares to the P4 3.6 (570), that seems about right. Heck, P4 might even be 1.1 to 1.2 IPC on average if K8 is 2.0. Branch misses kill IPC throughput on NetBurst, for example.We also don't know precisely (well, I don't) what the various traces represent. It could be that many traces actually take up two of the "issue slots", as traces don't have to be a single micro-op.
HyperThreading in NetBurst is really pretty simplistic. It also doens't really help improvement much except in very specific circumstances. I can't imagine any SMT configuration actually providing a bigger boost than SMP, though. (Otherwise everyone would already be doing it, rather than just NetBurst and high-end Server chips.) I seriously doubt that a 5% die space increase would be able to get more than a 10% performance increase. 10% I could see being 20 to 30%, and 15% could be 50% or more - of course, all just guesses and all under specific tests.
If you're not running multiple CPU-intensive threads, any form of SMT helps as much as SMP, which is to say not at all. Basically, this is all just guessing right now anyway, so there's no point in worrying about it too much. I have to think that Intel can get MUCH better performance with the next architecture than anything they've currently got, though. 2MB+ cache on CPUs is a lot of wasted space that could be better utilized, IMO.
nserra - Wednesday, August 17, 2005 - link
Yeah I completely agree!!
I was hoping AMD would release a 4 core processor with 128KB L2 cache for each core. That would give almost the same transistor count of 2 cores with 1MB L2. But “a lot” more speed.
Of course in MARKETING, having a processor with a total of 512KB L2 cache would be a budget one, but for me a excellent efficient design.
IntelUser2000 - Tuesday, August 16, 2005 - link
Well, a point to make is this: because the designers of Alpha CPUs managed beat every other CPU at every generation and every process generation, having simpler core, then its likely that the future generation would have done so too.Its not that companies are not using SMT because they don't know the benefits of SMT, its that they don't know how to make it good. Did you think it made sense for Intel do make Prescott core? IBM looks like best doing at SMT because they are only one of the two that actually uses SMT nowadays, the other being Intel at desktop chips. Plus, server chip design are usually pushed to their technical limits, while desktop chips are made for mainly mass production and profit.
(Exception is Itanium code-name Montecito's multi-threading, since it uses different form of it)
IntelUser2000 - Tuesday, August 16, 2005 - link
About the IPC, in theory P4 can output IPC of 2 and Athlon 64, three. So even with same branch misses, in theory Pentium 4 will be slower than Athlon 64, not to mention on the real one, it adds branch misses.About SMT, look here: http://www.realworldtech.com/page.cfm?ArticleID=RW...">http://www.realworldtech.com/page.cfm?ArticleID=RW...
"The enormous potential of SMT is shown by the expectation that it can approximately double the instruction throughput of an already impressive monster like the EV8 at the cost of only about 6% extra die area over a single threaded version of the design. That is a bigger speedup than can be typically achieved by duplicating the entire MPU as done in a 2 way SMP system!"
Though it looks as P4's multi-threading is a simple one not destined to take advantage of the architecture, its more the other way around.
Pentium 4 with limited IPC throughput(2 max), limited number of registers(8 and 16 in 64-bit), limited bandwidth, is crippling HT's ability.
Alpha EV8 was supposed to have IPC of 8 in theory, 1024 registers(!!), integrated memory controller with 20GB/sec bandwidth per CPU, and the architecture that was developed to take advantage of SMT from the beginning shows its full benefits.
Next-gen Itanium with multi-threading is a different story. Montecito doesn't use SMT, it uses different form of multi-threading, so its not really comparable.
Horshu - Friday, August 12, 2005 - link
Does Conroe's roadmap intersect early on with the 45 nm process (2007)? That was the point at which Intel was supposed to migrate over to the new high-K/metal transistor gates, although I recall something about those plans being dropped while Intel works on a new high-K process. The new gates were supposed to dramatically reduce heat dissipitation, although I have no idea what to expect from the new high-K they are working on.