Hardware Reference
In-Depth Information
4.1.2 Mira
The Blue Gene/Q system is an evolution of Blue Gene/P and is similar
in architecture. Mira, a 10 PFLOP Blue Gene/Q system, consists of 48 racks,
each containing 1024 system on chip 1.6 GHz 18 core nodes. The \A2" cores
are based on the 64-bit Power ISA v2.06 specification [2], with quad floating
point units per core and 16 GB of RAM for a total of 49,152 nodes, 786,432
cores, and 768 TB of RAM. There are several unique features on the processor.
Sixteen of the 18 cores are available for computation, one is used for operat-
ing system services, and the eighteenth core is a spare core in case one of the
other cores fails in production, but it is shut down during normal operations.
This core also supports transactional memory and speculative execution in
hardware. In the Blue Gene/Q, IBM collapsed the collective and barrier net-
work functionality into the torus network. It has a 5D torus network, which is
44442 in a 512 node half rack (midplane) that has 4 GB/s bi-directional
bandwidth in each torus dimension with 80ns nearest-neighbor latency and
1.5s maximum latency. There is an eleventh 4 GB/s link for doing I/O com-
munication, which is covered in more detail in the Section 4.2, Parker [7].
Additionally, ALCF runs two 100 node visualization clusters, Eureka and
Tukey. Eureka supports users on Intrepid and uses older single precision
NVidia GPUs. Tukey supports users on Mira, which uses newer double preci-
sion NVidia GPUs.
4.2 Overview of I/O at ALCF
There are many classes of I/O in HPC. The one that is most common and
the one most systems are designed around, is defensive I/O, usually called a
checkpoint. The primary purpose of defensive I/O is to write sucient infor-
mation to the disk so that the application can restart from that point in the
event the program fails before it completes. Sometimes restart is the only use
for this file, and once the next checkpoint is written, or the program com-
pletes, this file can be deleted, but sometimes the file provides useful output.
However, there are many other types of I/O that occur in the typical scientific
workflow. These include post processing or analysis, file transfers in and out
of the facility, file transfers from the scratch file system to a more permanent
location, and transfer to tape.
The common storage system design for HPC systems has a scratch file
system where checkpoint files are written. This file system is fast in order to
minimize the I/O time. However, because these systems are designed to be
fast and therefor expensive, the file systems are generally too small to hold all
the data, so facilities implement purge policies. A typical purge policy states
that if a file has not been accessed after a small number of weeks, say six or
 
Search WWH ::




Custom Search