Hardware Reference
In-Depth Information
coherency between the L1 caches on the four CPUs. Thus when a shared piece of
memory resides in more than one cache, accesses to that storage by one processor
will be immediately visible to the other three processors. A memory reference that
misses on the L1 cache but hits on the L2 cache takes about 11 clock cycles. A
miss on L2 that hits on L3 takes about 28 cycles. Finally, a miss on L3 that has to
go to the main DRAM takes about 75 cycles.
The four CPUs are connected via a high-bandwidth bus to a 3D torus network,
which requires six connections: up, down, north, south, east, and west. In addition,
each processor has a port to the collective network, used for broadcasting data to
all processors. The barrier port is used to speed up synchronization operations, giv-
ing each processor fast access to a specialized synchronization network.
At the next level up, IBM designed a custom card that holds one of the chips
shown in Fig. 8-38 along with 2 GB of DDR2 DRAM. The chip and the card are
shown in Fig. 8-39(a)-(b) respectively.
2-GB
DDR2
DRAM
Chip:
Card
Board
Cabinet
System
4 processors
8-MB L3 cache
1 Chip
4 CPUs
2GB
32 Cards
32 Chips
128 CPUs
64 GB
32 Boards
1024 Cards
1024 Chips
4096 CPUs
2TB
72 Cabinets
73728 Cards
73728 Chips
294912 CPUs
144 TB
(a)
(b)
(c)
(d)
(e)
Figure 8-39. The BlueGene/P: (a) chip. (b) card. (c) board. (d) cabinet.
(e) system.
The cards are mounted on plug-in boards, with 32 cards per board for a total of
32 chips (and thus 128 CPUs) per board. Since each card contains 2 GB of
DRAM, the boards contain 64 GB apiece. One board is illustrated in Fig. 8-39(c).
At the next level, 32 of these boards are plugged into a cabinet, packing 4096
CPUs into a single cabinet. A cabinet is illustrated in Fig. 8-39(d).
Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is
depicted in Fig. 8-39(e). A PowerPC 450 can issue up to 6 instructions/cycle, thus
 
Search WWH ::




Custom Search