Hardware Reference
In-Depth Information
Checkpoints, in part, guard against unexpected application failure, which
can be caused by non-permanent faults (i.e., soft errors). Cosmic-ray-induced
neutrons are a well-known source of soft errors in memory, having been ex-
tensively studied in at least one large supercomputer [23]. There are several
hardware-based strategies for mitigating soft errors in memory, but each has
its own trade-offs in terms of effectiveness, power consumption, and speed [25].
Exascale systems are expected to demonstrate that soft errors will become
common, and that designers should help mitigate soft errors in software.
Soft errors do not discriminate, and can manifest in application software
or system software. Application researchers are actively seeking modifications
to their algorithms and data structures that allow applications to successfully
continue in the face of soft errors [8]. System software researchers are surveying
operating systems for important data structures that require hardening, such
that soft errors will not cause a node to crash [9].
Library-based fault tolerance is also an option. Scalable Checkpoint
Restart (SCR) is an I/O library that provides a range of local checkpoint
strategies to applications [24]. Its main goal is to enable a node in a job to
checkpoint to one or more other nodes, allowing an application to recover
from failure of individual nodes in a job. It is designed to work with on-node
storage, which could be a RAM disk, spinning disk, flash, NVRAM, etc.
Another library-based approach includes rMPI [16]. rMPI uses replicated
computations, a proven method of ensuring operability of mission-critical sys-
tems. By implementing MPI calls to ensure message ordering and state main-
tenance between replicas, it allows an application to transparently benefit
from redundant computations. Although a job using rMPI uses at least twice
as many cores as compute ranks, the overhead can be reasonable. For very
large jobs, the decreased checkpoint and restart activity can yield appreciable
speed-ups over a non-replicated job running an equal number of nodes, while
simultaneously reducing potential load on an I/O system.
34.4 Conclusion
It is widely known that the projected power envelope for an exascale com-
puter will be extremely limiting. All subsystems are being investigated for
potential ineciencies, including storage. This chapter presents the widest
survey of storage power use in HPC systems, which shows that today's stor-
age systems are not generally significant users of power. However, extrapolat-
ing potential disk sizes and storage requirements to exascale systems shows
that today's methods of constructing and using storage will be completely
inadequate from a power perspective.
Fortunately, there are several technologies under development that serve to
reduce the demands on I/O systems for bulk synchronous checkpoints. These
 
Search WWH ::




Custom Search