PARALLEL COMPUTER ARCHITECTURES - Structured Computer Organization

Hardware Reference

In-Depth Information

of a tree. Many of the operations performed on highly parallel systems such as

BlueGene/P require the participation of all the nodes. For example, consider find-

ing the minimum value of a set of 65,536 values, one held in each node. The col-

lective network joins all the nodes in a tree. Whenever two nodes send their re-

spective values to a higher-level node, it selects out the smallest one and forwards

it upward. In this way, far less traffic reaches the root than if all 65,636 nodes sent

a message there.

The third network is the barrier network, used to implement global barriers and

interrupts. Some algorithms work in phases with each node required to wait until

all the others have completed the phase before starting the next phase. The barrier

network allows the software to define these phases and provide a way to suspend

all compute CPUs that reach the end of a phase until all of them have reached the

end, at which time they are all released. Interrupts also use this network.

The fourth and fifth networks both use 10-gigabit Ethernet. One of them con-

nects the I/O nodes to the file servers, which are external to BlueGene/P, and to the

Internet beyond. The other one is used for debugging the system.

Each CPU node runs a small, custom, lightweight kernel that supports a single

user and a single process. This process has at most four threads, one running on

each CPU in the node. This simple structure was designed for high performance

and high reliability.

For additional reliability, application software can call a library procedure to

make a checkpoint. Once all outstanding messages have been cleared from the

network, a global checkpoint can be made and stored so that in the event of a sys-

tem failure, the job can be restarted from the checkpoint, rather than from the be-

ginning. The I/O nodes run a traditional Linux operating system and support mul-

tiple processes.

Work is continuing to develop the next generation of BlueGene system, called

the BlueGene/Q. This system is expected to go online in 2012, and it will have 18

processors per compute chip, which also feature simultaneous multithreading.

These two features should greatly increase the number of instructions per cycle the

system can execute. The system is expected to reach speeds of 20 petaflops/sec.

For more information about BlueGene see Adiga et al. (2002), Alam et al., 2008,

Almasi et al. (2003a, 2003b), Blumrich et al. (2005), and IBM (2008).

Red Storm

As our second example of an MPP, let us consider the Red Storm machine

(also called Thor's Hammer) at Sandia National Laboratory. Sandia is operated by

Lockheed Martin and does classified and unclassified work for the U.S. Depart-

ment of Energy. Some of the classified work concerns the design and simulation

of nuclear weapons, which is highly compute intensive.

Sandia has been in this business for a long time and over the decades has had

many leading-edge supercomputers over the years. For decades, it favored vector

Search WWH ::

Custom Search

Home