Intel's Pentium 4 E: Prescott Arrives with Luggage
by Anand Lal Shimpi & Derek Wilson on February 1, 2004 3:06 PM EST- Posted in
- CPUs
An Impatient Prescott: Scheduler Improvements
Prescott can’t keep any more operations in flight (active in the pipeline) than Northwood, but because of the longer pipeline Prescott must work even harder to make sure that it is filled.
We just finished discussing branch predictors and their importance in determining how deep of a pipeline you can have, but another contributor to the equation is a CPU’s scheduling windows.
Let’s say you’ve got a program that is 3 operations long:
1. D = B + 1
2. A = 3 + D
3. C = A + B
You’ll notice that the 2nd operation can’t execute until the first one is complete, as it depends on the outcome (D) of the first operation. The same is true for the 3rd operation, it can’t execute until it knows what the value of A is. Now let’s say our CPU has 3 ALUs, and in theory could execute three adds simultaneously, if we just had this stream of operations going through the pipeline, we would only be using 1/3 of our total execution power - not the best situation. If we just upgraded from a CPU with 1 ALU, we would be getting the same throughput as our older CPU – and no one wants to hear that.
Luckily, no program is 3 operations long (even print “Hello World” is on the order of 100 operations) so our 3 ALUs should be able to stay busy, right? There is a unit in all modern day CPUs whose job it is to keep execution units, like ALUs, as busy as possible – as much of the time as possible. This is the job of the scheduler.
The scheduler looks at a number of operations being sent to the CPU’s execution cores and attempts to extract the maximum amount of parallelism possible from the operations. It does so by placing pending operations as soon as they make it to the scheduling stage(s) of the pipeline into a buffer or scheduling window. The size of the window determines the amount of parallelism that can be extracted, for example if our CPU’s scheduling window were only 3 operations large then using the above code example we would still only use 1/3 of our ALUs. If we could look at more operations, we could potentially find code that didn’t depend on the values of A, B or D and execute that in parallel while we’re waiting for other operations to complete. Make sense?
Because Intel increased the size of the pipeline on Prescott by such a large amount, the scheduling windows had to be increased a bit. Unfortunately, present microarchitecture design techniques do not allow for very large scheduling windows to be used on high clock speed CPUs – so the improvements here were minimal.
Intel increased the size of the scheduling windows used to buffer operations going to the FP units to coincide with the increase in pipeline.
There is also parallelism that can be extracted out of load and store operations (getting data out of and into memory). Let’s say that you have the following:
A = 1 + 3
Store A at memory location X
…
…
Load A from memory location X
The store operation actually happens as two operations (further pipelining by splitting up a store into two operations): a store address operation (where the data is going) and a store data operation (what the data actually is). The problem here is that the scheduler may try to parallelize the store operations and the load operation without realizing that the two are dependent on one another. Once this is discovered, the load will not execute and a performance penalty will be paid because the CPU’s scheduler just wasted time getting a load ready to execute and then having to get rid of it. The load will eventually execute after the store operations have completed, but at a significant performance penalty.
If a situation like the one mentioned above does crop up, long pipeline designs will suffer greatly – meaning that Prescott wants this to happen even less than Northwood. In Prescott, Intel included a small, very accurate, predictor to predict whether a load operation is likely to require data from a soon-to-be-executed store and hold that load until the store has executed. Although the predictor isn’t perfect, it will reduce bubbles of no-execution in the pipeline – a killer to Prescott and all long pipelined architectures.
Don’t look at these enhancements to improve performance, but to help balance the lengthened pipeline. A lot of the improvements we’ll talk about may sound wonderful but you must keep in mind that at this point, Prescott needs these technologies in order to equal the performance of Northwood so don’t get too excited. It’s an uphill battle that must be fought.
104 Comments
View All Comments
INTC - Monday, February 2, 2004 - link
Ummmm yea, kinda reminds me of cooking an egg on an Athlon XP http://www.biggaybear.co.uk/Menu/Aegg/Aeggs.htmlcliffa3 - Monday, February 2, 2004 - link
something good to include on the mb compatibility article would be what boards would house the 2.8/533...i'm wondering myself if the E7205 chipset would...i have a p4g8x, and it would be a welcome upgrade with HT and all the other goodies if it oc's well.Stlr22 - Monday, February 2, 2004 - link
They didn't burn down, but the proc were running hot. Not to mention, these are the FIRST releases in the Prescott line. What's it gonna be like later on?....Just think, a P4 based computer that turns your living room into your very own Sauna!!....WHOOO-HOOO!!.....now that's what I call a bargain!
INTC - Monday, February 2, 2004 - link
The message is clear: Anandtech and all of the other review sites didn't burn down so I guess it's not a flame thrower.Prescott is not as fast as I had hoped but is definitely not the step backwards as some were rumoring it to be. I think a Prescott 2.8 @ 250 MHz FSB will be really nice to play with until I see what Intel announces at IDF in a few weeks.
Icewind - Monday, February 2, 2004 - link
The message is clear: Im buying an Athlon 64.Vanners - Sunday, February 1, 2004 - link
Did anyone catch the error in Pipelining: 101?if you halve the time for a stage in the pipeline and double the number of stages. Yes this means you can run at 2GHz instead of 1GHz but the reality is you're still taking 5ns to complete the pipe.
Look at it like a motorbike: You drop down a gear and rev harder; you make more noise but you are still doing the same speed.
The only reasons to drop down a gear are to break through your gears (i.e. slow down) or to rev significantly higher than the change in gear ratio in order to move faster (with more torque).
The trouble Intel has is that they drop down a gear then rev 6 months to a year later.
kamper - Sunday, February 1, 2004 - link
Just curious, Anand or Derek: what board did you use to get the 3.72 GHz oc? Obviously it wasn't the intel board used in the benches. I guess we'll hear all about this in the compatibility review though :)keep up the good work, that last point about smaller margins at higher clockspeeds (vs. Northwood) was cool. Let's just hope the pattern continues.
Stlr22 - Sunday, February 1, 2004 - link
Seems to me like people either got cought up in some of the hype and expected to much or some people expected to little and that history would repeat itself (Willamette vs Palomino)The fact that the Prescott fared much better in it's launch compared to the Willamette might be a hint to not underestimate it. Prescott isn't really looking bad now, and I think it will hit stride faster then the Willamette core did.
The next couple of years are gonna be really interesting.
Damn, ya just gotta love it!
ntrights - Sunday, February 1, 2004 - link
Great review!KF - Sunday, February 1, 2004 - link
I've grown to appreciate CRAMITPAL. If you read around the opinionated diatribes, he has some good stuff that people avoid saying for fear of retaliation. I suppose if I were in love with Intel, he would tick me off.But, it does look like Intel has created a CPU that should ramp up to speeds high enough to beat the A64 in 32bit mode, and that is all they needed to do.
Regardless of how much heat that is going to take, Intel must have some way in the works to handle it.
Looks like they might not charge an arm and leg for it, which is the biggest shock.