Hardware Reference
In-Depth Information
a full BlueGene/P system could conceivably issue up to 1,769,472 instructions per
cycle. At 850 MHz, this gives the system a possible performance of 1.504
petaflops/sec. However, data hazards, memory latency, and lack of parallelism
work together to ensure that the actual throughput of the system is much less. Real
programs running on the BlueGene/P have demonstrated performance rates of up
to 1 petaflop/sec.
The system is a multicomputer in the sense that no CPU has direct access to
any memory except the 2 GB on its own card. While CPUs within a processor
chip have shared memory, processors at the board, rack, and system level do not
share the same memory. In addition, there is no demand paging because there are
no local disks to page off. Instead, the system has 1152 I/O nodes, which are con-
nected to disks and the other peripheral devices.
All in all, while the system is extremely large, it is also quite straightforward
with little new technology except in the area of high-density packaging. The decis-
ion to keep it simple was no accident since a major goal was high reliability and
availability. Consequently, a great deal of careful engineering went into the power
supplies, fans, cooling, and cabling with the goal of a mean-time-to-failure of at
least 10 days.
To connect all the chips, a scalable, high-performance interconnect is needed.
The design used is a three-dimensional torus measuring 72
32. As a conse-
quence, each CPU needs only six connections to the torus network, two to other
CPUs logically above and below it, north and south of it, and east and west of it.
These six connections are labeled east, west, north, south, up, and down, re-
spectively in Fig. 8-38. Physically, each 1024-node cabinet is an 8
×
32
×
×
8
×
16 torus.
Pairs of neighboring cabinets form an 8
×
8
×
32 torus. Four pairs of cabinets in
the same row form an 8
×
32
×
32 torus. Finally, all 9 rows form a 72
×
32
×
32
torus.
All links are thus point-to-point and operate at 3.4 Gbps. Since each of the
73,728 nodes has three links to ''higher'' numbered nodes, one in each dimension,
the total bandwidth of the system is 752 terabits/sec. The information content of
this topic is about 300 million bits, including all the art in encapsulated PostScript
format, so BlueGene/P could move 2.5 million copies of this topic per second.
Where they would go and who would want them is left as an exercise for the
reader.
Communication on the 3D torus is done in the form of virtual cut through
routing. This technique is somewhat akin to store-and-forward packet switching,
except that entire packets are not stored before being forwarded. As soon as a byte
has arrived at one node, it can be forwarded to the next one along the path, even
before the entire packet has arrived. Both dynamic (adaptive) and deterministic
(fixed) routing are possible. A small amount of special-purpose hardware on the
chip is used to implement the virtual cut through.
In addition to the main 3D torus used for data transport, four other communi-
cation networks are present. The second one is the collective network in the form
 
Search WWH ::




Custom Search