Hardware Reference
In-Depth Information
able to produce 371 megaflops/W, making it nearly twice as power efficient as its
predecessor the BlueGene/L. This first BlueGene/P deployment was upgraded in
2009 to include 294,912 processors, giving it a computational punch of 1
petaflop/sec.
The heart of the BlueGene/P system is the custom node chip illustrated in
Fig. 8-38. It consists of four PowerPC 450 cores running at 850 MHz. The Pow-
erPC 450 is a pipelined dual-issue superscalar processor popular in embedded sys-
tems. Each core has a pair of dual-issue floating-point units, which together can
issue four floating-point instructions per clock cycle. The floating-point units have
been augmented with a number of SIMD-type instructions sometimes useful in sci-
entific computations on arrays. While no performance slouch, this chip is clearly
not a top-of-the-line multiprocessor.
L1 caches
North
Up
Custom
chip
Interface
to 3D
torus
FPU
I
PowerPC
450 core
L2
cache
FPU
D
4-MB
L3
cache
To
DDR2
DRAM
Snooping
FPU
I
PowerPC
450 core
L2
cache
FPU
D
West
East
Snooping
FPU
I
PowerPC
450 core
L2
cache
FPU
D
4-MB
L3
cache
To
DDR2
DRAM
Snooping
FPU
I
PowerPC
450 core
L2
cache
FPU
D
Collective
Barrier
South
10Gb Ethernet
Down
Figure 8-38. The BlueGene/P custom processor chip.
Three levels of cache are present on the chip. The first consists of a split L1
cache with 32 KB for instructions and 32 KB for data. The second is a unified
cache consisting of a unified 2-KB cache. The L2 caches are really prefetch buff-
ers rather than true caches. They snoop on each other and are cache consistent.
The third level is a unified 4-MB shared cache that feeds data to the L2 caches.
The four processors share access to two 4-MB L3 cache modules. There is cache
 
Search WWH ::




Custom Search