Intel's Pentium 4 E: Prescott Arrives with Luggage
by Anand Lal Shimpi & Derek Wilson on February 1, 2004 3:06 PM EST- Posted in
- CPUs
Larger, Slower Cache
On the surface Prescott is nothing more than a 90nm Pentium 4 with twice the cache size, but we’ve hopefully been able to illustrate quite the contrary thus far. Despite all of the finesse Intel has exhibited with improving branch predictors, scheduling algorithms and new execution blocks they did exploit one of the easiest known ways to keep a long pipeline full – increase cache size.
With Prescott Intel debuted their highest density cache ever – each SRAM cell (the building blocks of cache) is now 43% smaller than the cells used in Northwood. What this means is that Intel can pack more cache into an even smaller area than if they had just shrunk the die on Prescott.
While Intel has conventionally increased L2 cache size, L1 cache has normally remained unchanged – armed with Intel’s highest density cache ever, Prescott gets a larger L1 cache as well as a larger L2.
The L1 Data cache has been doubled to a 16KB cache that is now 8-way set associative. Intel states that the access latency to the L1 Data cache is approximately the same as Northwood’s 8KB 4-way set associative cache, but the hit rate (probability of finding the data you’re looking for in cache) has gone up tremendously. The increase in hit rate is not only due to the increase in cache size, but also the increase in associativity.
Intel would not reveal (even after much pestering) the L1 cache access latency, so we were forced to use two utilities - Cachemem and ScienceMark to help determine if there was any appreciable increase in access latency to data in the L1.
Cachemem L1 Latency | ScienceMark L1 Latency | |
---|---|---|
Northwood |
1 cycle |
2 cycles |
Prescott |
4 cycles |
4 cycles |
Although Cachemem and ScienceMark don't produce identical results, they both agree on one thing: Prescott's L1 cache latency is increased by more than an insignificant amount. We will just have to wait for Intel to reveal the actual access latencies for L1 in order to confirm our findings here.
Although the size of Prescott’s Trace Cache remains unchanged, the Trace Cache in Prescott has been changed for the better thanks to some additional die budget the designers had.
The role of the Trace Cache is similar to that of a L1 Instruction cache: as instructions are sent down the pipeline, they are cached in the Trace Cache while data they are operating on is cached in the L1 Data cache. A Trace Cache is superior to a conventional instruction cache in that it caches data further down in the pipeline, so if there is a mispredicted branch or another issue that causes execution to start over again you don’t have to start back at Stage 1 of the pipeline – rather Stage 7 for example.
The Trace Cache accomplishes this by not caching instructions as they are sent to the CPU, but the decoded micro operations (µops) that result after sending them through the P4’s decoders. The point of decoding instructions into µops is to reduce their complexity, once again an attempt to reduce the amount of work that has to be done at any given time to boost clock speeds (AMD does this too). By caching instructions after they’ve already been decoded, any pipeline restarts will pick up after the instructions have already made it through the decoding stages, which will save countless clock cycles in the long run. Although Prescott has an incredibly long pipeline, every stage you can shave off during execution, whether through Branch Prediction or use of the Trace Cache, helps.
The problem with a Trace Cache is that it is very expensive to implement; achieving a hit rate similar to that of an instruction cache requires significantly more die area. The original Pentium 4 and even today’s Prescott can only cache approximately 12K µops (with a hit rate equivalent to an 8 – 16KB instruction cache). AMD has a significant advantage over Intel in this regard as they have had a massive 64KB instruction cache ever since Athlon. Today’s compilers that are P4 optimized are aware of the very small Trace Cache so they produce code that works around it as best as possible, but it’s still a limitation.
Another limitation of the Trace Cache is that because space is limited, not all µops can be encoded within it. For example, complicated instructions that would take a significant amount of space to encode within the Trace Cache are instead left to be sequenced from slower ROM that is located on the chip. Encoding logic for more complicated instructions can occupy precious die space that is already limited because of the complexity of the Trace Cache itself. With Prescott, Intel has allowed the Trace Cache to encode a few more types of µops inside the Trace Cache – instead of forcing the processor to sequence them from microcode ROM (a much slower process).
If you recall back to the branch predictor section of this review we talked about Prescott’s indirect branch predictor – to go hand in hand with that improvement, µops that involve indirect calls can now be encoded in the Trace Cache. The Pentium 4 also has a software prefetch instruction that developers can use to instruct the processor to pull data into its cache before it appears in the normal execution. This prefetch instruction can now be encoded in the Trace Cache as well. Both of these Trace Cache enhancements are designed to reduce latencies as much as possible, once again, something that is necessary because of the incredible pipeline length of Prescott.
Finally we have Prescott’s L2 cache: a full 1MB cache. Prescott’s L2 cache has caught up with the Athlon 64 FX, which it needs as it has no on-die memory controller and thus needs larger caches to hide memory latencies as much as possible. Unfortunately, the larger cache comes at the sacrifice of access latency – it now takes longer to get to the data in Prescott’s cache than it did on Northwood.
Cachemem L2 Latency | ScienceMark L2 Latency | |
---|---|---|
Northwood |
16 cycles |
16 cycles |
Prescott |
23 cycles |
23 cycles |
Both Cachemem and ScienceMark agree on Prescott having a ~23 cycle L2 cache - a 44% increase in access latency over Northwood. The only way for Prescott's slower L2 cache to overcome this increase in latency is by running at higher clock speeds than Northwood.
If our cache latency figures are correct, it will take a 4GHz Prescott to have a faster L2 cache than a 2.8GHz Northwood. It will take a 5GHz Prescott to match the latency of a 3.4GHz Northwood. Hopefully by then the added L2 cache size will be more useful as programs get larger, so we'd estimate that the Prescott's cache would begin to show an advantage around 4GHz.
Intel hasn’t changed any of the caching algorithms or the associativity of the L2 cache, so there are no tricks to reduce latency here – Prescott just has to pay the penalty.
For today’s applications, this increase in latency almost single handedly eats away any performance benefits that would be seen by the doubling of Prescott’s cache size. In the long run, as applications and the data they work on gets larger the cache size will begin to overshadow the increase in latency, but for now the L2 latency will do a good job of keeping Northwood faster than Prescott.
104 Comments
View All Comments
terrywongintra - Monday, February 2, 2004 - link
anybody benchmark prescott over northwood in entry-server environment? i'm installing 3 servers later by using intel 875p (s875wp1-e) entry server board n p4 2.8, need to decide prescott or northwood to use.sipc660 - Monday, February 2, 2004 - link
i don't understand why some people are bashing such a good inovation that was long overdue from intel.a pc that doubles as a heater and at only 100-200W power consumption.
Let me remind you that a conventional fan heater eats up a kilowatt/hour of power.
Think positive
* space reduction
* enormous power savings (pc + fan heater)
* extremly sophisticated looking fan haeter
* extremly safe casing. reduces burn injuries
to pets and children.
* finely tunable temperature settings (only need
to overclock by small increments)
* coupled with an lcd it features the best
looking temperature adjustment one has ever
witnessed on a heater
* child proof as it features thermal shutdown
* anyone having a laugh thus far
* will soon feature on american idol
the worst singers will receive one p4 E based
unit each. That should make people
think twice about auditioning thus making
sure only true talent shows up.
* gives dell new marketing potential and a crack
at a long desired consumer heating electronic
* amd is nowhere near this advancement in thermal
thechnology leaving intel way ahead
hope you enjoyed some of my thoughts
Other than that good article and some good comments.
on another note i don't understand why people run and fill intels pockets so intel can hide their engineering mistakes with unseen propaganda, while there is an obvious choice.
choice is Advanced Micro Devices all until intel gets their act together.
go amd...
Stlr22 - Monday, February 2, 2004 - link
INTC - "Intel roadmap says Prescott will hit 4.2 GHz by Q1 '05. My guess is that it is already running at 4 GHz but just needs to be fine tuned to reduce the heat."Maybe they are trying to keep it under the 200watt mark? ;-)
INTC - Monday, February 2, 2004 - link
I think CRAMITPAL must have sat on a hot Prescott and got it stuck where the sun doesn't shine - that would explain all of the yelling and screaming and friggin this and friggin that going on. "Approved mobo, approved PC case cooling system, approved heatsink & fan - and you better not use Artic Silver or else it will void your warranty..." gee - didn't we just hear that when Athlon XPs came out? It brings to mind when TechTV put their dual Athlon MP rig together and it started smoking and catching on fire when they fired it up the first time on live television during their show.Intel roadmap says Prescott will hit 4.2 GHz by Q1 '05. My guess is that it is already running at 4 GHz but just needs to be fine tuned to reduce the heat. I bet the experts (or self proclaimed experts such as CRAM) were betting that Northwood could not hit 3 GHz and look where it is at today. Video card GPUs today are hitting 70 degrees C plus at full load but they do fine with cooling in the same PC cases.
CRAMITPAL - Monday, February 2, 2004 - link
Dealing with the FLAME THROWER's heat issues is only one aspect of Prescott's problems. The chip is a DOG and it requires an "approved Mobo" and an "approved PC case cooling system", a premo PSU cause the friggin thing draws 100+ Watts and this crap all costs money you don't need to spend on an A64 system that is faster, runs cooler, and does both 32/64 bit processing faster. How difficult is THIS to comprehend???Ain't no way Intel is gonna be able to Spin this one despite the obvious "press material" they supplied to all the reviewers to PIMP that Prescott was designed to reach 5 Gigs. Pigs will fly lightyears before Prescott runs at 5 Gigs.
Time to GET REAL folks. Prescott sucks and every hardware review site politely stated so in "political speak".
Stlr22 - Monday, February 2, 2004 - link
((((((((((((CRAMITPAL)))))))))))))))It's ok man. It's ok. Everything will be alright.
;-)
scosta - Monday, February 2, 2004 - link
#38 - About your "Did anyone catch the error in Pipelining: 101?".There is no error. The time it takes to travel the pipelane is just a kind of process delay. What matters is the rate at witch finished/processed results come out of the pipeline. In the case of the 0.5ns/10 stage pipelane you will get one finished result every 0.5ns, twice as many as in the case of the 1ns/5 stage pipeline.
If the pipelines were building motorcycles, you woud get, respectively, 1 and 2 motorcycles every ns. And that is the point.
LordSnailz - Monday, February 2, 2004 - link
I'm sure the prescotts will get hotter as the speed increases but you can't forget there are companies out there that specializes in this area. There are 3 companies that I know of that are doing research on ways to reduce the heat, for instance, they're planning on placing a piece of silicon with etch lines on top of the CPU and run some type of coolant through it. Much like the radiator concept.My point is, Intel doesn't have to worry about the heat too much since there are companies out there fighting that battle. Intel will just concentrate on achieving those higher speeds and the temp control solution will come.
scosta - Monday, February 2, 2004 - link
You can find thermal power information in the also excelent "Aces Hardware" Prescot review here:[L=myurl]http://www.aceshardware.com/read.jsp?id=60000317[/l]
In resume, we have the following Typical Thermal Power :
P4 3.2 GHz (Northwood) - 82W
P4E 3.2 GHz (Prescot) - 103W
Note that, at the same clock speed and with the same or lesser performance, the Prescot dissipates 25% more power than Northwood. This means that with a similar cooling system, the Prescot has to run substancially hoter.
As AcesHardware says,
[Q]After running a 3DSMax rendering and restarting the PC, the BIOS reported that the 3.2 GHz Northwood was at about 45-47°C, while Prescott was flirting with 64-66°C. Mind you, this is measured on a motherboard completely exposed to the cool air (18°C) of our lab.[/Q]
So, what will the ~5GHx Prescot dissipate? 200W ?
Will we all be forced to run PCs with bulky, expensive, etc, criogenic cooling systems?. I for one wont. This power consumption escalation has to stop. Intel and AMD have to improve the performace of their CPUs by improving the CPU archytecture and manufacturing processes, not by trowing more and more electrical power at the problem.
And those are my 2 cents.
CRAMITPAL - Monday, February 2, 2004 - link
Prescott will never go above 3.8 Gig. even with the 3rd revision of the 90 nano process. Tejas will make it to just over 4.0 Gig. with a little luck but it won't be anything to write home about either based on current knowledge.Intel has fallen and can't get it up!