Intel's Pentium 4 E: Prescott Arrives with Luggage
by Anand Lal Shimpi & Derek Wilson on February 1, 2004 3:06 PM EST- Posted in
- CPUs
An Impatient Prescott: Scheduler Improvements
Prescott can’t keep any more operations in flight (active in the pipeline) than Northwood, but because of the longer pipeline Prescott must work even harder to make sure that it is filled.
We just finished discussing branch predictors and their importance in determining how deep of a pipeline you can have, but another contributor to the equation is a CPU’s scheduling windows.
Let’s say you’ve got a program that is 3 operations long:
1. D = B + 1
2. A = 3 + D
3. C = A + B
You’ll notice that the 2nd operation can’t execute until the first one is complete, as it depends on the outcome (D) of the first operation. The same is true for the 3rd operation, it can’t execute until it knows what the value of A is. Now let’s say our CPU has 3 ALUs, and in theory could execute three adds simultaneously, if we just had this stream of operations going through the pipeline, we would only be using 1/3 of our total execution power - not the best situation. If we just upgraded from a CPU with 1 ALU, we would be getting the same throughput as our older CPU – and no one wants to hear that.
Luckily, no program is 3 operations long (even print “Hello World” is on the order of 100 operations) so our 3 ALUs should be able to stay busy, right? There is a unit in all modern day CPUs whose job it is to keep execution units, like ALUs, as busy as possible – as much of the time as possible. This is the job of the scheduler.
The scheduler looks at a number of operations being sent to the CPU’s execution cores and attempts to extract the maximum amount of parallelism possible from the operations. It does so by placing pending operations as soon as they make it to the scheduling stage(s) of the pipeline into a buffer or scheduling window. The size of the window determines the amount of parallelism that can be extracted, for example if our CPU’s scheduling window were only 3 operations large then using the above code example we would still only use 1/3 of our ALUs. If we could look at more operations, we could potentially find code that didn’t depend on the values of A, B or D and execute that in parallel while we’re waiting for other operations to complete. Make sense?
Because Intel increased the size of the pipeline on Prescott by such a large amount, the scheduling windows had to be increased a bit. Unfortunately, present microarchitecture design techniques do not allow for very large scheduling windows to be used on high clock speed CPUs – so the improvements here were minimal.
Intel increased the size of the scheduling windows used to buffer operations going to the FP units to coincide with the increase in pipeline.
There is also parallelism that can be extracted out of load and store operations (getting data out of and into memory). Let’s say that you have the following:
A = 1 + 3
Store A at memory location X
…
…
Load A from memory location X
The store operation actually happens as two operations (further pipelining by splitting up a store into two operations): a store address operation (where the data is going) and a store data operation (what the data actually is). The problem here is that the scheduler may try to parallelize the store operations and the load operation without realizing that the two are dependent on one another. Once this is discovered, the load will not execute and a performance penalty will be paid because the CPU’s scheduler just wasted time getting a load ready to execute and then having to get rid of it. The load will eventually execute after the store operations have completed, but at a significant performance penalty.
If a situation like the one mentioned above does crop up, long pipeline designs will suffer greatly – meaning that Prescott wants this to happen even less than Northwood. In Prescott, Intel included a small, very accurate, predictor to predict whether a load operation is likely to require data from a soon-to-be-executed store and hold that load until the store has executed. Although the predictor isn’t perfect, it will reduce bubbles of no-execution in the pipeline – a killer to Prescott and all long pipelined architectures.
Don’t look at these enhancements to improve performance, but to help balance the lengthened pipeline. A lot of the improvements we’ll talk about may sound wonderful but you must keep in mind that at this point, Prescott needs these technologies in order to equal the performance of Northwood so don’t get too excited. It’s an uphill battle that must be fought.
104 Comments
View All Comments
ianwhthse - Sunday, February 1, 2004 - link
*sigh*Well, now I know.
*goes to buy A64*
KristopherKubicki - Sunday, February 1, 2004 - link
read the article...Stlr22 - Sunday, February 1, 2004 - link
31 stage pipeline?!.....lol..guess those "30 stage pipelne" rumors were true.These processors aren't bad at all. They performed on the same level as the Northwood versions. They just aren't worth the "premium" price tag that they will carry for now.
Looks like there wont be a better time to grab a Northwwod,
as I'm sure these puppies will keep dropping in price to make room for the Prescotts.
Thatguy97 - Wednesday, April 29, 2015 - link
lol never even made to 4ghz man you guys did not give intel the crap it deserved