Intel's Larrabee Architecture Disclosure: A Calculated First Move
by Anand Lal Shimpi & Derek Wilson on August 4, 2008 12:00 AM EST- Posted in
- GPUs
The Design Experiment: Could Intel Build a GPU?
Larrabee is fundamentally built out of existing Intel x86 core technology, which not only means that the chip design isn't foreign to Intel, but also has serious implications for the future of desktop microprocessors. Larrabee isn't however built on Intel's current bread and butter, the Core architecture, instead Intel turned to a much older architecture as the basis for Larrabee: the original Pentium.
The original Pentium was manufactured on a 0.80µm process, later shrinking to 0.60µm. The question Intel posed was this: could an updated version of the Pentium core, built on a modern day process and equipped with a very wide vector unit, make a solid foundation for a high-end GPU?
To first test the theory Intel took a standard Core 2 Duo, with a 4MB L2 cache at an undisclosed clock speed (somewhere in the 1.8 - 2.9GHz range I'd guess). Then, on the same manufacturing process, roughly the same die area and power consumption, Intel sought to find out how many of these modified Pentium cores it could fit. The number was 10.
So in the space of a dual-core Core 2 Duo, Intel could construct this hypothetical 10-core chip. Let's look at the stats:
Intel Core 2 Duo | Hypothetical Larrabee | |
# of CPU Cores | 2 out of order | 10 in-order |
Instructions per Issue | 4 per clock | 2 per clock |
VPU Lanes per Core | 4-wide SSE | 16-wide |
L2 Cache Size | 4MB | 4MB |
Single-Stream Throughput | 4 per clock | 2 per clock |
Vector Throughput | 8 per clock | 160 per clock |
Note that what we're comparing here are operation throughputs, not how fast it can actually execute anything, just how many operations it can retire per clock.
Running a single instruction stream (e.g. single threaded application), the Core 2 can process as many as four operations per clock, since it can issue 4-instructions per clock and it isn't execution unit constrained. The 10-core design however can only issue two instructions per clock and thus the peak execution rate for a single instruction stream is two operations per clock, half the throughput of the Core 2. That's fine however since you'll actually want to be running vector operations on this core and leave your single threaded tasks to your Core 2 CPU anyways, and here's where the proposed architecture spreads its wings.
With two cores, each with their ability to execute 4 concurrent SSE operations per clock, you've got a throughput of 8 ops per clock on Core 2. On the 10-core design? 160 ops per clock, an increase of 20x in roughly the same die area and power budget.
On paper this could actually work. If you had enough of these cores, you could get the vector throughput necessary to actually build a reasonable GPU. Of course there are issues like adapting the x86 instruction set for use in a GPU, getting all of the cores to communicate with one another and actually keeping all of these execution resources busy - but this design experiment showed that it was possible.
Thus Larrabee was born.
101 Comments
View All Comments
Shinei - Monday, August 4, 2008 - link
Some competition might do nVidia good--if Larrabee manages to outperform nvidia, you know nvidia will go berserk and release another hammer like the NV40 after R3x0 spanked them for a year.Maybe we'll start seeing those price/performance gains we've been spoiled with until ATI/AMD decided to stop being competitive.
Overall, this can only mean good things, even if Larrabee itself ultimately fails.
Griswold - Monday, August 4, 2008 - link
Wake-up call dumbo. AMD just started to mop the floor with nvidias products as far as price/performance goes.watersb - Monday, August 4, 2008 - link
great article!You compare the Larrabee to a Core 2 duo - for SIMD instructions, you multiplied by a (hypothetical) 10 cores to show Larrabee at 160 SIMD instructions per clock (IPC). But you show non-vector IPC as 2.
For a 10-core Larrabee, shouldn't that be x10 as well? For 20 scalar IPC
Adamv1 - Monday, August 4, 2008 - link
I know Intel has been working on Ray Tracing and I'm really curious how this is going to fit into the picture.From what i remember Ray Tracing is a highly parallel and scales quite well with more cores and they were talking about introducing it on 8 core processors, it seems to me this would be a great platform to try it on.
SuperGee - Thursday, August 7, 2008 - link
How it fit's.GPU from ATI and nV are called HArdware renderers. Stil a lot of fixed funtion. Rops TMU blender rasterizer etc. And unified shader are on the evolution to get more general purpouse. But they aren't fully GP.
This larrabee a exotic X86 massive multi core. Will act as just like a Multicore CPU. But optimised for GPU task and deployed as GPU.
So iNTel use a Software renderer and wil first emulate DirectX/OpenGL on it with its drivers.
Like nv ATI is more HAL with as backup HEL
Where Larrabee is pure HEL. But it's parralel power wil boost Software method as it is just like a large bunch of X86 cores.
HEL wil runs fast, as if it was 'HAL' with LArrabee. Because the software computing power for such task are avaible with it.
What this means is that as a GFX engine developer you got full freedom if you going to use larrabee directly.
Like they say first with a DirectX/openGL driver. Later with also a CPU driver where it can be easy target directly. thus like GPGPU task. but larrabee could pop up as extra cores in windows.
This means, because whatever you do is like a software solution.
You can make a software rendere on Ratracing method, but also a Voxel engine could be done to. But this software rendere will be accelerated bij the larrabee massive multicore CPU with could do GPU stuf also very good. But will boost any software renderer. Offcourse it must be full optimised for larrabee to get the most out of it. using those vector units and X86 larrabee extention.
Novalogic could use this to, for there Voxel game engine back in the day's of PIII.
It could accelerate any software renderer wich depend heavily on parralel computing.
icrf - Monday, August 4, 2008 - link
Since I don't play many games anymore, that aspect of Larrabee doesn't interest me any more than making economies of scale so I can buy one cheap. I'm very interested in seeing how well something like POV-Ray or an H.264 encoder can be implemented, and what kind of speed increase it'd see. Sure, these things could be implemented on current GPUs through Cuda/CTM, but that's such an different kind of task, it's not at all quick or easy. If it's significantly simpler, we'd actually see software sooner that supports it.cyberserf - Monday, August 4, 2008 - link
one word: MATROXGuuts - Monday, August 4, 2008 - link
You're going to have to use more than one word, sorry... I have no idea what in this article has anything to do with Matrox.phaxmohdem - Monday, August 4, 2008 - link
What you mean you DON'T have a Parhelia card in your PC? WTF is wrong with you?TonyB - Monday, August 4, 2008 - link
but can it play crysis?!