Intel's Pentium 4 E: Prescott Arrives with Luggage
by Anand Lal Shimpi & Derek Wilson on February 1, 2004 3:06 PM EST- Posted in
- CPUs
Typical branches have one of two options: either don’t take the branch, or go to the target instruction and begin executing code there:
A Typical Branch ... |
---|
There is a third type of branch – called an indirect branch – that complicates predictions a bit more. Instead of telling the CPU where to go if the branch is taken, an indirect branch will tell the CPU to look at an address in a register/main memory that will contain the location of the instruction that the CPU should branch to. An indirect branch predictor, originally introduced in the Pentium M (Banias), has been included in Prescott to predict these types of branches.
An Indirect Branch ... |
---|
Conventionally, you predict an indirect branch somewhat haphazardly by telling the CPU to go to where most instructions of the program end up being located. It’s sort of like needing to ask your boss what he wants you to do, but instead of asking just walking into the computer lab because that’s where most of your work ends up being anyways. This method of indirect branch prediction ends up working well for a lot of cases, but not all. Prescott’s indirect branch predictor features algorithms to handle these cases, although the exact details of the algorithms are not publicly available. The fact that the Prescott team borrowed this idea from the Pentium M team is a further testament to the impressive amount of work that went into the Pentium M, and what continues to make it one of Intel’s best designed chips of all time.
Prescott’s indirect branch predictor is almost directly responsible for the 55% decrease in mispredicted branches in the 253.perlbmk SPEC CPU2000 test. Here’s what the test does:
253.perlbmk is a cut-down version of Perl v5.005_03, the popular scripting language. SPEC's version of Perl has had most of OS-specific features removed. In addition to the core Perl interpreter, several third-party modules are used: MD5 v1.7, MHonArc v2.3.3, IO-stringy v1.205, MailTools v1.11, TimeDate v1.08
The reference workload for 253.perlbmk consists of four scripts:
The primary component of the workload is the freeware email-to-HTML converter MHonArc. Email messages are generated from a set of random components and converted to HTML. In addition to MHonArc, which was lightly patched to avoid file I/O, this component also uses several standard modules from the CPAN (Comprehensive Perl Archive Network).
Another script (which also uses the mail generator for convienience) excercises a slightly-modified version of the 'specdiff' script, which is a part of the CPU2000 tool suite.
The third script finds perfect numbers using the standard iterative algorithm. Both native integers and the Math::BigInt module are used.
Finally, the fourth script tests only that the psuedo-random numbers are coming out in the expected order, and does not really contribute very much to the overall runtime.The training workload is similar, but not identical, to the reference workload. The test workload consists of the non-system-specific parts of the acutal Perl 5.005_03 test harness.
In the case of the mail-based benchmarks, a line with salient characteristics (number of header lines, number of body lines, etc) is output for each message generated.
During processing, MD5 hashes of the contents of output "files" (in memory) are computed and output.
For the perfect number finder, the operating mode (BigInt or native) is output, along with intermediate progress and, of course, the perfect numbers.
Output for the random number check is simply every 1000th random number generated.
As you can see, the performance improvement is in a real-world algorithm. As is commonplace for microprocessor designers to do, Intel measured the effectiveness of Prescott’s branch prediction enhancements in SPEC and came up with an overall reduction in mispredicted branches of about 13%:
164.gzip | 1.94% |
175.vpr | 8.33% |
176.gcc | 17.65% |
181.mcf | 9.63% |
186.crafty | 4.17% |
197.parser | 17.92% |
252.eon | 11.36% |
253.perlbmk | 54.84% |
254.gap | 27.27% |
255.vortex | -12.50% |
256.bzip2 | 5.88% |
300.twolf | 6.82% |
Overall | 12.78% |
The improvements seen above aren’t bad at all, however remember that this sort of a reduction is necessary in order to make up for the fact that we’re now dealing with a 55% longer pipeline with Prescott.
The areas that received the largest improvement (> 10% fewer mispredicted branches) were in 176.gcc, 197.parser, 252.eon, 253.perlbmk and 254.gap. The 176.gcc test is a compiler test, which the Pentium 4 has clearly lagged behind the Athlon 64 in. 197.parser is a word processing test, also an area where the Pentium 4 has done poorly in the past thanks to branch-happy integer code. 252.eon is a ray tracer, and we already know about 253.perlbmk; improvements in 254.gap could have positive ramifications for Prescott’s performance in HPC applications as it simulates performance in math intensive distributed data computation.
The benefit of improvements under the hood like the branch prediction algorithms we’ve discussed here is that they are taken advantage of on present-day software, with no recompiling and no patches. Keep this in mind when we investigate performance later on.
We’ll close this section off with another interesting fact – although Prescott features a lot of new improvements, there are other improvements included in Prescott that were only introduced in later revisions of the Northwood core. Not all Northwood cores are created equal, but all of the enhancements present in the first Hyper Threading enabled Northwoods are also featured in Prescott.
104 Comments
View All Comments
mattsaccount - Sunday, February 1, 2004 - link
From the HardOCP review: "Certainly moving to watercooling helped us out a great deal. In fact it is hard for us to recommend buying a Prescott and cooling it any other way."eBauer - Sunday, February 1, 2004 - link
I am curious as to why the UT2k3 botmatch scores dropped on all CPU's... Different map?Pumpkinierre - Sunday, February 1, 2004 - link
Sorry errata on #20 that was 3.0 Northood result is out of kilter with other cpus in dtata analysis sysmark 2004.Pumpkinierre - Sunday, February 1, 2004 - link
JFK,Vietnam,Nixon,Monica,Bush/Gore,Iraq and now this! - what is going on with the leader of the free world.I hope it overclocks well- that's all that's going for it. Maybe Intel should rethink their multiplier locked policy. AMD must get in there and profit. I still dont understand why the caches are running at half the latency as Northood if they are the same speed and structure? Is it as a result of a doubling in size for the same associativity?Good article- needs re-rereading after digestion. Last chart in Sysmark2004 (data analysis) has 3.0 Prescott totally outperformed by 2.8 Prescott and all other cpus. Look like a benchmark/typing glitch.
yak8998 - Sunday, February 1, 2004 - link
first the error:pg 9 -
The LDDQU instruction is one Intel is particularly proud of as it helps accelerate video encoding and it is implemented in the DivX 5.1.1 codec. More information on how it is used can be found in Intel’s developer documentation here.
No link?
===
"What's the power consumption like on these new bad boys?
Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended?? "
I'm going to guess a clean running ~350W or so should suffice for a regular system, but I'm not positive with these monster gfx cards out rite now...
"Any of you know what the cache size on the EE's will be?"
If your talking about the Northwood (the p4c's are still considered northwoods, no?), its 1mb I believe.
(still finishing the article. man i love these in-depth technical articles)
Tiorapatea - Sunday, February 1, 2004 - link
I agree, some info on power consumption please.Thanks for the article, by the way.
I guess we'll have to wait and see how Prescott ramps in speed versus 90nm A64.
AgaBooga - Sunday, February 1, 2004 - link
Much better than the P4's origional launch...All I want to know now is what AMD is going to do soon... They'll probably counteract Prescott with high clock speeds but when and by how much is what matters.
Any of you know what the cache size on the EE's will be?
Also, the final CPU's based on Northwood are kind of like a car with the ratio curves or whatever they're called, but basically after a point of revving, going any higher doesn't give you as much of an increase in speed as it would at a lower rpm increasing the same amount.
Cygni - Sunday, February 1, 2004 - link
AMD's roadmap shows a 4000+ Athlon64 by the end of the year... which is the same as Intel's. They are aware, im sure.Stlr22 - Sunday, February 1, 2004 - link
What's the power consumption like on these new bad boys?Is anything less than a quality 450watt PSU gonna be generally *NOT* recommended??
HammerFan - Sunday, February 1, 2004 - link
Things are gonna get hairy in '04 and '05!!! My take is that AMD nees to get their marketing up-to-spec or the high-clocked prescotts are gonna run the show.I have a question for Derek and Anand: What kind of temps does the prescott run at? what type of cooler does it have? (there's nothing there to support or refute claims that the prescott is one hot potato)