Hardware Reference
In-Depth Information
of a tree. Many of the operations performed on highly parallel systems such as
BlueGene/P require the participation of all the nodes. For example, consider find-
ing the minimum value of a set of 65,536 values, one held in each node. The col-
lective network joins all the nodes in a tree. Whenever two nodes send their re-
spective values to a higher-level node, it selects out the smallest one and forwards
it upward. In this way, far less traffic reaches the root than if all 65,636 nodes sent
a message there.
The third network is the barrier network, used to implement global barriers and
interrupts. Some algorithms work in phases with each node required to wait until
all the others have completed the phase before starting the next phase. The barrier
network allows the software to define these phases and provide a way to suspend
all compute CPUs that reach the end of a phase until all of them have reached the
end, at which time they are all released. Interrupts also use this network.
The fourth and fifth networks both use 10-gigabit Ethernet. One of them con-
nects the I/O nodes to the file servers, which are external to BlueGene/P, and to the
Internet beyond. The other one is used for debugging the system.
Each CPU node runs a small, custom, lightweight kernel that supports a single
user and a single process. This process has at most four threads, one running on
each CPU in the node. This simple structure was designed for high performance
and high reliability.
For additional reliability, application software can call a library procedure to
make a checkpoint. Once all outstanding messages have been cleared from the
network, a global checkpoint can be made and stored so that in the event of a sys-
tem failure, the job can be restarted from the checkpoint, rather than from the be-
ginning. The I/O nodes run a traditional Linux operating system and support mul-
tiple processes.
Work is continuing to develop the next generation of BlueGene system, called
the BlueGene/Q. This system is expected to go online in 2012, and it will have 18
processors per compute chip, which also feature simultaneous multithreading.
These two features should greatly increase the number of instructions per cycle the
system can execute. The system is expected to reach speeds of 20 petaflops/sec.
For more information about BlueGene see Adiga et al. (2002), Alam et al., 2008,
Almasi et al. (2003a, 2003b), Blumrich et al. (2005), and IBM (2008).
Red Storm
As our second example of an MPP, let us consider the Red Storm machine
(also called Thor's Hammer) at Sandia National Laboratory. Sandia is operated by
Lockheed Martin and does classified and unclassified work for the U.S. Depart-
ment of Energy. Some of the classified work concerns the design and simulation
of nuclear weapons, which is highly compute intensive.
Sandia has been in this business for a long time and over the decades has had
many leading-edge supercomputers over the years. For decades, it favored vector
 
Search WWH ::




Custom Search