Intel's Pentium 4 E: Prescott Arrives with Luggage
by Anand Lal Shimpi & Derek Wilson on February 1, 2004 3:06 PM EST- Posted in
- CPUs
An Impatient Prescott: Scheduler Improvements
Prescott can’t keep any more operations in flight (active in the pipeline) than Northwood, but because of the longer pipeline Prescott must work even harder to make sure that it is filled.
We just finished discussing branch predictors and their importance in determining how deep of a pipeline you can have, but another contributor to the equation is a CPU’s scheduling windows.
Let’s say you’ve got a program that is 3 operations long:
1. D = B + 1
2. A = 3 + D
3. C = A + B
You’ll notice that the 2nd operation can’t execute until the first one is complete, as it depends on the outcome (D) of the first operation. The same is true for the 3rd operation, it can’t execute until it knows what the value of A is. Now let’s say our CPU has 3 ALUs, and in theory could execute three adds simultaneously, if we just had this stream of operations going through the pipeline, we would only be using 1/3 of our total execution power - not the best situation. If we just upgraded from a CPU with 1 ALU, we would be getting the same throughput as our older CPU – and no one wants to hear that.
Luckily, no program is 3 operations long (even print “Hello World” is on the order of 100 operations) so our 3 ALUs should be able to stay busy, right? There is a unit in all modern day CPUs whose job it is to keep execution units, like ALUs, as busy as possible – as much of the time as possible. This is the job of the scheduler.
The scheduler looks at a number of operations being sent to the CPU’s execution cores and attempts to extract the maximum amount of parallelism possible from the operations. It does so by placing pending operations as soon as they make it to the scheduling stage(s) of the pipeline into a buffer or scheduling window. The size of the window determines the amount of parallelism that can be extracted, for example if our CPU’s scheduling window were only 3 operations large then using the above code example we would still only use 1/3 of our ALUs. If we could look at more operations, we could potentially find code that didn’t depend on the values of A, B or D and execute that in parallel while we’re waiting for other operations to complete. Make sense?
Because Intel increased the size of the pipeline on Prescott by such a large amount, the scheduling windows had to be increased a bit. Unfortunately, present microarchitecture design techniques do not allow for very large scheduling windows to be used on high clock speed CPUs – so the improvements here were minimal.
Intel increased the size of the scheduling windows used to buffer operations going to the FP units to coincide with the increase in pipeline.
There is also parallelism that can be extracted out of load and store operations (getting data out of and into memory). Let’s say that you have the following:
A = 1 + 3
Store A at memory location X
…
…
Load A from memory location X
The store operation actually happens as two operations (further pipelining by splitting up a store into two operations): a store address operation (where the data is going) and a store data operation (what the data actually is). The problem here is that the scheduler may try to parallelize the store operations and the load operation without realizing that the two are dependent on one another. Once this is discovered, the load will not execute and a performance penalty will be paid because the CPU’s scheduler just wasted time getting a load ready to execute and then having to get rid of it. The load will eventually execute after the store operations have completed, but at a significant performance penalty.
If a situation like the one mentioned above does crop up, long pipeline designs will suffer greatly – meaning that Prescott wants this to happen even less than Northwood. In Prescott, Intel included a small, very accurate, predictor to predict whether a load operation is likely to require data from a soon-to-be-executed store and hold that load until the store has executed. Although the predictor isn’t perfect, it will reduce bubbles of no-execution in the pipeline – a killer to Prescott and all long pipelined architectures.
Don’t look at these enhancements to improve performance, but to help balance the lengthened pipeline. A lot of the improvements we’ll talk about may sound wonderful but you must keep in mind that at this point, Prescott needs these technologies in order to equal the performance of Northwood so don’t get too excited. It’s an uphill battle that must be fought.
104 Comments
View All Comments
sprockkets - Monday, February 2, 2004 - link
Hmmm... on Intel's website on the new processor news: "Thermal Monitoring: Allows motherboards to be cost-effectively designed to expected application power usages rather than theoretical maximums."Not sure what it means. I'm thinking clock throttling so that if your particular chip is hotter than it should be it will run on under engineered motherboards/coolers.
This chip dissipates around the same heat as Northwoods clock for clock! And of course, Intel style is wait 6-12, then the new stuff will actually be good. Still, is it really that important to increase performance so much that heat becomes an issue? I.E., will Dell be able to make the cooling whisper quiet? They can with the processor sitting at 80-90c, but now that with normal cooling it's almost there, now what will they do? Why can't we just have new processors that run so cool that we can just use heatsinks without fans? Oh well.
Novaoblivion - Monday, February 2, 2004 - link
Great article :) I found it very interesting I dont think I'll be buying a prescott till they hit about 4Ghz. My 2.4C is nice and fast for now.CRAMITPAL - Monday, February 2, 2004 - link
http://www.theinquirer.net/?article=13927
http://www.theinquirer.net/?article=13947
johnsonx - Monday, February 2, 2004 - link
To Vanners, #38:"if you halve the time for a stage in the pipeline and double the number of stages. Yes this means you can run at 2GHz instead of 1GHz but the reality is you're still taking 5ns to complete the pipe."
Yes and no... In the example, you're right that a single instruction takes the same 5ns to complete. But you're not just executing a single instruction... rather, thousands to millions! The 10 stage pipe has twice as many instructions in flight as the 5 stage pipe. Therefore in the example, you get one result out of the 5-stage/1Ghz cpu every 1ns, but TWO results out of the 10-stage/2Ghz cpu in the same 1ns... twice as many.
What I find interesting is that as pipelines get longer and longer, we might have to start talking about Instruction Latency: the number of clocks and ns between the time an instruction goes in and when the result comes out. It'll never be anything a human could notice directly, but it might come into play in high-performance realtime apps that deal with input from the outside world, and have to produce synchronized output. Any CPU calculates somewhat "back-in-time" as instructions fly down the pipe... right now, a Prescott calculates about twice as far behind 'reality' as an A64 does. I don't know if there is any realworld application where this really could make a difference, or if there ever will be, but it's interesting to ponder, particularly if the pipeline lengths of Intel vs. AMD continue to diverge.
cliffa3 - Monday, February 2, 2004 - link
i don't see how a 4+GHz prescott will match up with intel's new pico BTX form factor...with that much heat (using air cooling), you need to keep a safe zone around the proc unless you like your RAM DDR+BBQ.I'd have to say that a lot of enthusiasts are younger and live in limited space conditions...might work well for people up north who don't want to run the heater, but as for me in texas, i have all the cool air pumping in to my bedroom and it still takes a lot to keep it cool. Can you imagine a university or corporation having a room full of those?..if they think about that, then it's no bueno for DELL and others as well.
I'd also have to agree with the others about the heat/power being a major part of the article that was left out...otherwise a tremendous read, thanks for all the effort that goes into these.
tfranzese - Monday, February 2, 2004 - link
But - I need to add - the correction was needed and is welcome. Not trying to pick a bone with the editors.tfranzese - Monday, February 2, 2004 - link
#55, you read what I read. I'll vouch for you.Icewind - Monday, February 2, 2004 - link
#55Better go back to sleep me thinks :)
Spearhawk - Monday, February 2, 2004 - link
Is it just me (who was extremely tired yesterday) or has the 101 on pipeline part changed since the article was put up?I seem to rememeber reading someting about how a 5 staged CPU at 1 Ghz should be exactly as fast as a 2 GHz CPU with 10 stages (all else being equal of course) and that the secret of geting any profit out of going to more stages was to make sure that it couldn't only scale to 2 Ghz but to 3 Ghz or more.
Icewind - Monday, February 2, 2004 - link
I think shuttle owners are SOL with prescott.