SUN’s UltraSparc T1 - the Next Generation Server CPUs

Name: SUN&rsquo;s UltraSparc T1 - the Next Generation Server CPUs
Item: SUN&rsquo;s UltraSparc T1 - the Next Generation Server CPUs
Author: Johan De Gelas

by Johan De Gelas on December 29, 2005 10:03 AM EST

Posted in
CPUs

49 Comments | Add A Comment

49 Comments

The SUN benchmarks ...

Although we haven't run benchmarks yet, the benchmarks that SUN presents^[2] are still interesting. We'll delve deeper once we have our own benchmarks. The power consumption numbers are estimates. We tried to give you both the typical and the maximum values. Some manufacturers give only typical numbers (Intel, IBM) while others only give maximum numbers (AMD), so we had to find other sources and base our estimates upon them.

JBB2005 represents an order processing application for a wholesale supplier written in Java.

Specjbb2005

System	CPU	Power Dissipation CPUs (Estimated)	Number of cores	Number of Active threads	Score	Percentage score
Sun Fire T2000	1x 1.2GHz UltraSPARC T1	72-79 W	8	32	63,378	160%
Sun Fire X4200	2x 2.4GHz DC Opteron	150-180 W	4	4	45,124	114%
IBM p5 550	2x 1.9GHz POWER5+	320-360 W	4	8	61,789	156%
IBM xSeries 346	2x 2.8GHz DC Xeon	270-300 W	4	8	39,585	100%

The performance of the T1 is simply amazing. Of course, this is an ideal benchmark for the T1 with many java threads. The Power 5+ is the only one that comes close, as it can process 8 threads simultaneously just like the T1. But it consumes +/- 4 times more than the T1.

SPECweb2005 emulates users sending browser requests over broadband Internet connections to a web server. It provides three new workloads: a banking site (HTTPS), an e-commerce site (HTTP/HTTPS mix), and a support site (HTTP). Dynamic content is implemented in PHP and JSP.

Specweb2005

System	Processors	Power Dissipation CPUs (Estimated)	Number of cores	Number of Active threads	Score	Percentage score
Sun Fire T2000	1x 1.2GHz UltraSPARC T1	72-79 W	8	32	14,001	289%
IBM p5 550	2x 1.9GHz POWER5+	320-360 W	4	8	7,881	162%
IBM xSeries 346	2x 3.8GHz Xeon	220-260 W	4	4	4,348	90%
Dell 2850	2x 2.8GHz DC Xeon	260-300 W	4	8	4,85	100%

Here, the T1 is by far the best CPU. This is, however, a very hard to interpret benchmark. For example, back in 2003, I did some benchmarking on a JSP server. Our first results were very weird: a single Xeon performed just as well as a dual Xeon, despite the fact that the Gigabit PCI NIC was not at its limits at all (about 180 Mbit/s). Once we used an Intel NIC, things became better, but the network bottleneck wasn't gone before we used a CSA (directly connected to the Northbridge) Intel NIC. The benchmark depends more on the quality of the NIC driver, the latency from the NIC to the memory (DMA) and of course, the quality of the NIC chip itself than on the CPU. That being said, it is clear that Web servers spawn a lot of threads that do not require a lot of processing unless they are encrypted. So, this is the natural habitat of the T1 CPU. As long as you can make sure that the CPU is the bottleneck, the CPU which can perform the most threads per cycle will win.

SAP 2 Tier is based on the number one ERP software. The database back-end and application run on the same machine.

System	Processors	Power Dissipation CPUs (Estimated)	Number of cores	Number of Active threads	Score	Percentage score
Sun Fire T2000	1x 1.2GHz UltraSPARC T1	72-79 W	8	32	4780	97%
IBM p5 550	2x 1.9GHz POWER5+	320-360 W	4	8	5020	102%
HP DL580	4x 3.33GHz Xeon MP	440-520 W	4	8	4700	96%
HP DL385	2x 2.2GHz DC Opteron	140-180 W	4	4	4920	100%

SAP 2-tier is a typical example of a benchmark with very low IPC. However, some of the queries are more complex, so the T1 cannot outperform the fatter cores. Still, the performance per watt is unbeatable.

Unbeatable?

The words "paradigm shift" and "disruptive" technology have been abused so many times that we don't like to use them. But in the case of the T1 CPU, it wouldn't be exaggerated to say that it is the herald of a new generation of server CPUs, and that it has disrupted the server market. Single core, single threaded CPUs do not have a chance in this market anymore. Does this also signal the end of superscalar CPUs in the server environment? Is the massive multi-core with scalar cores the future for the entire server world? The SUN UltraSparc T1 simply wipes the floor with the competition when it comes to performance per Watt. According to this metric, the UltraSparc T1 is 4 to 12 times better.

Fig 7: The cores of the T1 processor are hardly warmer than the rest of the die. A "fat" core has much more hotspots.

However, we think that there are also opportunities for the fatter cores. The main weakness of the T1 is the shallow pipeline and clock speed. The need to be compatible with the previous Sparcs and thus, the need for the relatively big Register Window system (with 1 cycle access) also limits clock speed. While the competition has bigger cores, it does not need as many cores as the T1. Each superscalar core could make better use of its resources by using Coarse Grained Multi threading (Montecito), FMT or SMT (Power 5). That should allow these kinds of cores to achieve higher IPC per core. Clock speed can be 2- 3 times higher, allowing two dual cores or one quad core "fat" CPUs to outperform the T1.

These kinds of CPUs consume quite a bit more power, but as long as this extra power usage is not dramatically higher, fat cores might still have a good chance in the market. After all, it is total system power that counts, and large RAID arrays and AC units often represent larger power draws than just the CPU. With the exception of the web server market, power consumption is not the number one priority most of the time, although it is important.

A study sponsored by SUN^[3] shows that the best results in commercial server loads are achieved with 4 to 6 threads per core, combined with 2 to 3-way superscalar in order cores. This is another indication that there is a lot of room for very different multi-core approaches such as Intel's Montecito, IBM Power 6+ and upcoming multi-core Xeons and Opterons. A multi-threaded 64-bit version of Sossaman (31 Watt TDP per two cores) could also threaten the UltraSparc T1.

In some server related markets, fat multi-cores might even be more preferable. Once such market is the OLAP databases, where very complex queries are sent by a limited number of users. The response time of the T1 could be rather mediocre there, while a higher clocked CPU with fewer cores could be quite a bit more responsive in these loads. Also, OLAP queries that calculate statistical data will use more FP instructions.

The 8 little cores that could Virtualization

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

49 Comments

View All Comments

thesix - Friday, December 30, 2005 - link
If you're talking about POWER5's SMT, currently it provides two HW threads per core:
http://publib.boulder.ibm.com/infocenter/pseries/i...">http://publib.boulder.ibm.com/infocente...x.doc/ai...

If you look closer at T1, the best one has 8 cores, each core supports four HW threads.
http://www.sun.com/processors/UltraSPARC-T1/">http://www.sun.com/processors/UltraSPARC-T1/

SMT and CMT appear to be the same type of technology (at least conceptual wise) with different names from two vendors.

> The very very poor FP performance of T1 is the truth.
> We have to remind ourselves that it is only a integer CPU. It's FP performance is too terrible.

OK. Since you have repeated so many times, I am sure everyone who's reading this will remember, and I do not disagree :-).

Thanks.
Betwon - Friday, December 30, 2005 - link
We think that it is diffirent between CMT and SMT.

For exapmle:
P4 630 is a kind of SMT CPU, but not a CMT CPU.
AthlonX2 is a kind of CMT CPU, but not a SMT CPU.

From anandtech:
T1 has no branch prediction,and it has only one-instruction-issue/core, 8KB L1D/core(too few for 4 threads to use).

POWER5 has 32KB L1D/core, which is used by two threads.

We think that the SMT of T1 may be OK, unless 4 threads only use very few L1D cache(It is impossible for most cases)
Betwon - Friday, December 30, 2005 - link
edit:
The only explain about how to improve the efficiency(very poor) is to use SMT to hide the stall's latency(by branch miss/cache miss ect.)

But a core has only 8KB L1(which will be used by 4 threads), the cache miss will increase. It is possible to become worst.
Betwon - Friday, December 30, 2005 - link
edit: T1 have no branch prediction and it has only one_inst_issue/core.
Brian23 - Friday, December 30, 2005 - link
Obviously the apps that they used to benchmark in this article like running on the chip. Also, this chip doesn't run windows. It runs Sun's proprietary operating system. (I forgot what it's called.) Sun will give this new chip software support because they want it to do well.

I think I read in the article that the chip is backwards compatable with the previous design Sun chips, meaning a lot of software is already available that will run on the chip.
Betwon - Friday, December 30, 2005 - link
NO!

It is too narrow for the areas of 32-thread-parallel-well apps.

'have many threads' is not equal to '32-thread-parallel-well'!

Even there are 32 threads, but without parallel-well , This new CPU will waste more than 90% of it's potential.

The efficiency of Itanium( Itanium is capable of a 1.3-1.5 IPC) is much better than x86-CPU(0.7-0.9 IPC). Itanium never used OOO logic and long pipelines.
Betwon - Friday, December 30, 2005 - link
The efficiency of Itanium2 is still better than IBM's POWER5, and a Itanium2 core may retire 6 instrutions/cycle,and POWER5's can retire 5-instrutions/cycle.

But a core of this new CPU is only one instrutions/cycle.
Brian23 - Friday, December 30, 2005 - link
I think you missed the part where x86 chips spend 400 cycles waiting on memory accesses when the Sun chip just keeps chugging with another thread while the load is happening.
Calin - Tuesday, January 3, 2006 - link
Those 400 cycles are related to the higher clock speed (if your processor would be twice as slow, it would wait only 200 cycles). I assume the 400 cycles are based on the Xeon processor (that has high clock speed and slower FSB).
Betwon - Friday, December 30, 2005 - link
NO!
It is not true for all the x86 CPU.When Athlon64 spend many cycles waiting on memory accesses,
For P4 with HT,P4 just keeps chugging with another thread while the load is happening.

Do you understand what I want to say?

SUN’s UltraSparc T1 - the Next Generation Server CPUs

Post Your Comment

49 Comments

View All Comments

thesix - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Brian23 - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Brian23 - Friday, December 30, 2005 - link

Calin - Tuesday, January 3, 2006 - link

Betwon - Friday, December 30, 2005 - link

Log in

Don't have an account? Sign up now

SUN&rsquo;s UltraSparc T1 - the Next Generation Server CPUs

Post Your Comment

49 Comments

View All Comments

thesix - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Brian23 - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Betwon - Friday, December 30, 2005 - link

Brian23 - Friday, December 30, 2005 - link

Calin - Tuesday, January 3, 2006 - link

Betwon - Friday, December 30, 2005 - link

Log in

Don't have an account? Sign up now

SUN’s UltraSparc T1 - the Next Generation Server CPUs