Intel's Pentium 4 E: Prescott Arrives with Luggage
by Anand Lal Shimpi & Derek Wilson on February 1, 2004 3:06 PM EST- Posted in
- CPUs
An Impatient Prescott: Scheduler Improvements
Prescott can’t keep any more operations in flight (active in the pipeline) than Northwood, but because of the longer pipeline Prescott must work even harder to make sure that it is filled.
We just finished discussing branch predictors and their importance in determining how deep of a pipeline you can have, but another contributor to the equation is a CPU’s scheduling windows.
Let’s say you’ve got a program that is 3 operations long:
1. D = B + 1
2. A = 3 + D
3. C = A + B
You’ll notice that the 2nd operation can’t execute until the first one is complete, as it depends on the outcome (D) of the first operation. The same is true for the 3rd operation, it can’t execute until it knows what the value of A is. Now let’s say our CPU has 3 ALUs, and in theory could execute three adds simultaneously, if we just had this stream of operations going through the pipeline, we would only be using 1/3 of our total execution power - not the best situation. If we just upgraded from a CPU with 1 ALU, we would be getting the same throughput as our older CPU – and no one wants to hear that.
Luckily, no program is 3 operations long (even print “Hello World” is on the order of 100 operations) so our 3 ALUs should be able to stay busy, right? There is a unit in all modern day CPUs whose job it is to keep execution units, like ALUs, as busy as possible – as much of the time as possible. This is the job of the scheduler.
The scheduler looks at a number of operations being sent to the CPU’s execution cores and attempts to extract the maximum amount of parallelism possible from the operations. It does so by placing pending operations as soon as they make it to the scheduling stage(s) of the pipeline into a buffer or scheduling window. The size of the window determines the amount of parallelism that can be extracted, for example if our CPU’s scheduling window were only 3 operations large then using the above code example we would still only use 1/3 of our ALUs. If we could look at more operations, we could potentially find code that didn’t depend on the values of A, B or D and execute that in parallel while we’re waiting for other operations to complete. Make sense?
Because Intel increased the size of the pipeline on Prescott by such a large amount, the scheduling windows had to be increased a bit. Unfortunately, present microarchitecture design techniques do not allow for very large scheduling windows to be used on high clock speed CPUs – so the improvements here were minimal.
Intel increased the size of the scheduling windows used to buffer operations going to the FP units to coincide with the increase in pipeline.
There is also parallelism that can be extracted out of load and store operations (getting data out of and into memory). Let’s say that you have the following:
A = 1 + 3
Store A at memory location X
…
…
Load A from memory location X
The store operation actually happens as two operations (further pipelining by splitting up a store into two operations): a store address operation (where the data is going) and a store data operation (what the data actually is). The problem here is that the scheduler may try to parallelize the store operations and the load operation without realizing that the two are dependent on one another. Once this is discovered, the load will not execute and a performance penalty will be paid because the CPU’s scheduler just wasted time getting a load ready to execute and then having to get rid of it. The load will eventually execute after the store operations have completed, but at a significant performance penalty.
If a situation like the one mentioned above does crop up, long pipeline designs will suffer greatly – meaning that Prescott wants this to happen even less than Northwood. In Prescott, Intel included a small, very accurate, predictor to predict whether a load operation is likely to require data from a soon-to-be-executed store and hold that load until the store has executed. Although the predictor isn’t perfect, it will reduce bubbles of no-execution in the pipeline – a killer to Prescott and all long pipelined architectures.
Don’t look at these enhancements to improve performance, but to help balance the lengthened pipeline. A lot of the improvements we’ll talk about may sound wonderful but you must keep in mind that at this point, Prescott needs these technologies in order to equal the performance of Northwood so don’t get too excited. It’s an uphill battle that must be fought.
104 Comments
View All Comments
Jeff7181 - Sunday, February 1, 2004 - link
I'm going to go out on a limb here and say 2004 is the year of the Athlon-64 and Intel will take a back seat this year unless their new socket will help increase clock speeds. When AMD makes the transition to 90nm I think you'll see a jump in clock speed from them too... and I'm willing to bet their current 130nm processors will scale to 2.6 or 2.8 Ghz if they want to put the effort into it before switching to 90nm.Intel better hope people adopt SSE3 in favor of AMD-64 otherwise they're going to lose the majority of the benchmark tests.
On second thought... the real question is how high will Prescott scale... will we really see 4.0 Ghz by the end of the year? Will performance scale as well as it does with the Athlon-64?
Right now, looking at the Prescott, the best I can say for it is "huh, 31 stages in the pipeline and they didn't lose too much performance, neat."
Barkuti - Sunday, February 1, 2004 - link
Check out the article at xbitlabs:http://www.xbitlabs.com/articles/cpu/display/presc...
Less technical but with a wider set of tests.
Stlr22 - Sunday, February 1, 2004 - link
;-)Stlr22 - Sunday, February 1, 2004 - link
((((((((((((((CRAMITPAL))))))))))))))))Listen,I just want you to know that everything will be alright. Really, life isn't all that bad buddy. It's not good to keep so much hate inside. It's very unhealthy. We are all family here at the Anandtech forums and we care about you. If you ever need to sit down and talk, I'm ll ears pal. So that your brother doesn't feel left out, here's a hug for him aswell.......
(((((((((((((AMDjihad)))))))))))))
KF - Sunday, February 1, 2004 - link
Yeah, the Inquirer was right about 30 stages. Maybe I should start reading it! However I did read the one where the news linked to an article purporting that an Inquirer reporter had bumped into a person who had overheard an Intel executive say Prescott was 64 bit. Maybe Derek and Anand didn't have the space to squeeze that tiny detail into the review.I saw a paper on the Intel site a while ago, seemingly intended for some professional jounal, the premise of which was that it is ALWAYS preferable to make the pipeline longer, no matter how long, while using techniques to reduce the penalties. Like, 100 stages would be a good thing. Right then I knew what one team at Intel was up to. The fact that they didn't explain any new penalty reduction techniques only made it all the more sure what Intel had in the works (otherwise why write the paper?), and that they had the techniques worked out, but still under wraps.
ianwhthse - Sunday, February 1, 2004 - link
Err.. *CramitpalSorry about that. My mind is wandering.
ianwhthse - Sunday, February 1, 2004 - link
Did we actually just get 26 good posts in before crumpet showed up?FiberOptik - Sunday, February 1, 2004 - link
I like the part about the new shift/rotate unit on the CPU. Does this mean that prescott will be noticeably faster for the RC5 project? Athlon's usually mop the floor with whatever the Northwood can pump out.eBauer - Sunday, February 1, 2004 - link
"Botmatch has bots (AI) playing, shooting, running, etc. (deathmatch) while Flyby does not. The number that you should be most interested in is the Botmatch scores."No, I am talking about the botmatch scores from previous articles. Well aware of the difference between flyby and botmatch. http://www.anandtech.com/cpu/showdoc.html?i=1946&a... In that article, all CPU's had about 10 more fps than the CPU's in the prescott article.
AnonymouseUser - Sunday, February 1, 2004 - link
"I am curious as to why the UT2k3 botmatch scores dropped on all CPU's... Different map?"Botmatch has bots (AI) playing, shooting, running, etc. (deathmatch) while Flyby does not. The number that you should be most interested in is the Botmatch scores.