Hardware Reference
In-Depth Information
4.1 HPC at ALCF
In 2004, in response to Presidential and Congressional guidance that lead-
ership in computing and computational science was a national priority, DOE
released a call for proposals for \Leadership Computing Facilities" which
would field systems capable of solving the largest, most complex, and most
detailed problems. Access to these facilities would be allocated via the Innova-
tive and Novel Computational Impact on Theory and Experiment (INCITE)
program [3]. A joint proposal between ANL, Oak Ridge National Laboratory
(ORNL), and Pacific Northwest National Laboratory (PNNL) was selected
by DOE. The proposal called for ANL and ORNL to field systems of differ-
ing architectures, while PNNL would provide portions of the software stack.
The differing architectures were proposed not only because some problems
are more suited to one architecture than another, but also for risk mitigation
should one of the new systems have significant start-up problems.
In 2004 ANL formed the Blue Gene Consortium in cooperation with IBM.
In 2005, a 5 TFLOP, one rack Blue Gene/L system was fielded for evaluation,
and then in 2006 began supporting six INCITE projects. In 2007, eight racks
of Blue Gene/P were brought online, and in 2008 the machine was expanded
to 40 racks (the current 557 TFLOP Intrepid system), and increased to 20
INCITE projects. In 2009 a 10 PFLOP machine was approved, which was
installed in 2012|the current Mira system [1].
4.1.1 Intrepid
Intrepid consists of 40 racks, each containing 1024 \system on a chip" 850
MHz quad core nodes, which are based on the Power PC 450 core with dual
floating point units per core and 2GB of RAM. This yields a total of 40,960
nodes, 163,840 cores, and 80 TB of RAM. The Blue Gene/P has five different
networks: a 3D torus for internode point-to-point communications that have
5.1 GB/s per node, a 0.5s per hop latency, and a 5s farthest hop latency;
a tree or collective network for doing MPI collective operations and I/O with
6.8 GB/s per link per direction and 1.3s latency per tree traversal; a global
barrier and interrupt network with 0.65s hardware latency; 10 Gbps Ethernet
from the I/O nodes for I/O to storage; and a 1 Gbps Reliability, Serviceability,
and Availability (RAS) network, see Morozov [6]. For comparison, typical store
and forward Ethernet switches have single hop hardware latencies in the range
of tens of microseconds and Inniband is around 1.7s, as noted by the HPC
Advisory Council [5].
 
Search WWH ::




Custom Search