Intel Celeron D: New, Improved & Exceeds Expectations
by Derek Wilson on June 24, 2004 3:01 AM EST- Posted in
- CPUs
Under The Hood of Celeron D
For an in-depth look at what's different with the new Celeron, the first 11 or so pages of our Pentium 4 E (Prescott) launch article do an excellent job of covering the bases. For a quick summary, here's a look at the major changes inside the Prescott core:- 90nm Strained Silicon Process - more, faster transistors in less space
- 31 Pipeline Stages - for clock speed ramping
- Improved Branch Predictor - helps avoid pipeline stall
- Improved Scheduler - helps avoid doing unnecessary work
- Improved Execution Core - added integer multiply and fast shift to ALU
- Larger, Slower Caches - higher latency caches for speed and size scaling
- SSE3 - 13 new instructions
Even with the ominous 31-stage pipeline and higher latency caches, we get better performance with the new Celeron D. So, how does all this stack up to make Prescott a better Celeron than Northwood? Well, let's take it step by step.
First of all, the 16kb L1 cache size of Prescott has a significant impact on the Celeron. Northwood based Celerons only have 8kb of L1 cache. With 8kb more of the on die data stored "closer" (in terms of latency) to the processor, we will definitely see more cache hits get to the processor quicker in spite of the fact that cache latency on Celeron D is the same as Pentium 4 E. Prescott's cache latency is much higher than Northwood's. Improving this ability to recover is critical, as eventhough Celeron D has an increased L2 cache, the size of on die memory is still small and cache misses will occur more than on the Pentium 4.
When dealing with a processor short on cache and prone to very painful pipeline stalls, improving the average cache hit latency can really help to keep extra stalls from happening (a fast L2 hit will come back in about 25 cycles on Prescott), and can help to refill the pipeline once its stalled (as more data will be able to get back into the pipeline faster).
This 8kb of extra L1 cache is a much smaller portion of Pentium 4's total cache size. Since Pentium 4 E has fewer cache misses than Celeron D (it has 4 times the L2 cache), improvements to the L1 cache size don't have as much opportunity to shine.
Speaking of L2, the Celeon D has received an increase from 128kb in the current Celeron to 256kb. Even though this is still a quarter of the (still insufficient) 1MB cache the Pentium 4 E has, we aren't going to see the same type of performance drop we saw when moving from the Northwood Pentium 4 to Celeron (which also had a quarter of its big brother's cache). The reason is the number of cache hits we will see increase rapidly and hit a point of diminishing returns after a certain size. The curve is similar to a logarithmic curve (benefits increase rapidly as cache size increases at first, but then level off quickly).
What it comes down to is that doubling a small cache (say, going from 128kb to 256kb) will have a much higher impact on performance (because the number of cache hits is significantly increased) than doubling a larger cache (like going from 512kb to 1MB). In other words, P4 E gets less benefit from its doubled L2 cache than Celeron D.
While we're on the subject of caches and memory, the 533MHz frontside bus effectively gets data from memory to the processor faster in case of a cache miss. This is very important in the low- cache environment of the Celeron world. Unfortunately, we couldn't increase our multiplier and run our 2.8 GHz Celeron 335 at 28x100 to see just what kind of impact bus speed has on the new processor.
The enhancements Intel made to branch prediction and scheduling round out the factors that help make Prescott an excellent Celeron core. Since we're working with a small L2 cache, it is excessively important to work with good data and avoid stalls for reasons other than cache misses. Northwood is at a disadvantage to Prescott here. Better branch prediction will help avoid filling the cache with data from a mis-predicted branch as well as aid in averting unnecessary bubbles in the pipeline for the same reason. Better scheduling means more efficient use of the data available to the processor as well. Northwood is stuck on these two counts. Adding an integer multiply and fast shift/rotate to Prescott also helped the Celeron D maintain a high level of efficiency, but this really shouldn't have any greater impact on Celeron D than on Pentium 4.
It all comes down to being resilient and efficient. Northwood is very dependent on its L2 cache size. The enhancements Intel made to Prescott in order to avoid that large negative impact of adding so many pipeline stages really benefit the processor when it is starved for data. Prescott has to be more careful not to stall just to keep up with the current Pentium 4 line. As a result, the Celeron flavor can deal with tighter constraints on L2 cache size, which help even more when paired with a larger cache than the Northwood derived version.
54 Comments
View All Comments
Minot - Thursday, June 24, 2004 - link
Can we get a comparison of a P4 2.4A (Prescott, 1MB L2 cache, 533 MHz FSB) compared to these new Celeron D processors?Pumpkinierre - Thursday, June 24, 2004 - link
Yeah, there's something more to this than meets the eye. I dont really follow your cache arguments, Derek (and I'm known not to like caches when they are irrelevant). To me what applied to the P4E applies to the celeron D. Its a pity you didnt throw in a 533MHz 2.8E in your benchmarks. I predicted the Prescott celeron would be a good buy but more on the basis of less heat and better o'clocking. The only conclusion I can come of all this, is the Prescott core is better than we think but the cache structure is the problem. Else they've changed something in the pipeline architecture of these celeron Ds which may have ramifications for later stepping P4Es.TrogdorJW - Thursday, June 24, 2004 - link
#39 - Oh. Dang. Oops. Still, I think I probably would have chuckled more than anything. Who here hasn't made a major mistake at some point in their life? The only problem is that with Internet "publishing", your mistake can be put on the web in minutes rather than days.Wonder how many "Flame AnandTech" threads have started up on other hardware forums about the original article?
TrogdorJW - Thursday, June 24, 2004 - link
Just out of curiosity, what's the actual transistor count and die size of the Celeron D?With the trimmed down L2 cache, that cuts out about 40 million transistors. If they removed dormant 64-bit stuff from the Prescott core as well (or some other unnecessary additions), we're down to almost the same transistor count as Northwood, except with a 90 nm process.
Even with 75 million transistors, if it still uses 8 layers like Prescott (and not 6 like Northwood), that would put the die size at less than 70 mm2 by my calculations. Yowza!
Picture this: 70 mm2 die size CPUs on 300 mm wafers. That gives an absolute maximum of 1009 CPUs per wafer, minus those that are on the outer edge (i.e. partial cores). Even with conservative yields of 60%, we're talking about roughly 600 CPUs per wafer. No wonder they're so cheap.
DerekWilson - Thursday, June 24, 2004 - link
Unfortunately, TrogdorJW, most of the premise of the original publishing was based around the assumption that the Celeron D was able to out perform its predecessor inspite of having an equal sized cache.It was a very large error, and certainly worthy of the outrage people have voiced.
Thanks for sticking up for us though. And please be assured that we will be much more careful. Again, we appologize for the error.
TrogdorJW - Thursday, June 24, 2004 - link
I think some overclocking results are definitely in order, though. Take the Celeron D 325 2.53 GHz part and overclock that to a 166 MHz bus and you get a 3.2 GHz (3.167 GHz) part. It would be interesting to see how that compares to the P4 3.2C and 3.2E - Sure, it will still be (a bit?) slower, but at less than half the cost!I'm guessing that a 200 MHz bus is unreachable, as that would give you 3.8 GHz. Then again, from the 915/925 roundup, it seems that 3.8 or 3.9 GHz was reached with many of the Prescott CPUs. Damn... $80 for a CPU that might actually get close to 3.8 GHz!? Either I'm dreaming - entirely possible - or we have a return of the good old 300A overclocking days! Pray for the latter!
So, seriously, let's have the overclocking results, and compare that to regular P4, Athlon XP, and Athlon 64 results.
TrogdorJW - Thursday, June 24, 2004 - link
Mino, considering that you can't even write a single sentence without spelling and/or grammar errors, I would think that calling the article "full of errors" is rather like the pot calling the kettle black. "I didn't intend to make me looka 'smart', nor is my opinion I am." Way to make yourself look even dumber! Granted, I'm looking at the corrected version of the article, but even if I had read it with the incorrect L2 cache size, it's not that big of a deal. (Unless the original article had statements along the lines of, "even with the same size cache the new Celeron D outperforms... blah blah blah..."?)mino - Thursday, June 24, 2004 - link
#32 actually U are WRONG, fastest BUDGET proc from AMD is AthlonXP 2800+ which compared to Celeron2.8(here in SK called "Zelenina"-> means "vegetable") is like horse to ponny;)However Cel. D is welcome improvement from Cel. based on Willys and Northwds. It is however move from near unusability(but like heater they were gut:) to low usability. Improved computing and heating performance is a good sign, especially ina winret nights when tey turn off heating ussually in my work :).
Xaazier - Thursday, June 24, 2004 - link
intel must be making these things cheap, 90nm process and only 256kb l2also the number of celerons sold in cheap emachines and dells is high right?
mino - Thursday, June 24, 2004 - link
#31 Eh, sorry, by "Anand" i meant shortened word "AnandTech".My mistake, won't repeat:-).