Power Consumption - High Performance Parallel I/O

Hardware Reference

In-Depth Information

Checkpoints, in part, guard against unexpected application failure, which

can be caused by non-permanent faults (i.e., soft errors). Cosmic-ray-induced

neutrons are a well-known source of soft errors in memory, having been ex-

tensively studied in at least one large supercomputer [23]. There are several

hardware-based strategies for mitigating soft errors in memory, but each has

its own trade-offs in terms of effectiveness, power consumption, and speed [25].

Exascale systems are expected to demonstrate that soft errors will become

common, and that designers should help mitigate soft errors in software.

Soft errors do not discriminate, and can manifest in application software

or system software. Application researchers are actively seeking modifications

to their algorithms and data structures that allow applications to successfully

continue in the face of soft errors [8]. System software researchers are surveying

operating systems for important data structures that require hardening, such

that soft errors will not cause a node to crash [9].

Library-based fault tolerance is also an option. Scalable Checkpoint

Restart (SCR) is an I/O library that provides a range of local checkpoint

strategies to applications [24]. Its main goal is to enable a node in a job to

checkpoint to one or more other nodes, allowing an application to recover

from failure of individual nodes in a job. It is designed to work with on-node

storage, which could be a RAM disk, spinning disk, flash, NVRAM, etc.

Another library-based approach includes rMPI [16]. rMPI uses replicated

computations, a proven method of ensuring operability of mission-critical sys-

tems. By implementing MPI calls to ensure message ordering and state main-

tenance between replicas, it allows an application to transparently benefit

from redundant computations. Although a job using rMPI uses at least twice

as many cores as compute ranks, the overhead can be reasonable. For very

large jobs, the decreased checkpoint and restart activity can yield appreciable

speed-ups over a non-replicated job running an equal number of nodes, while

simultaneously reducing potential load on an I/O system.

34.4 Conclusion

It is widely known that the projected power envelope for an exascale com-

puter will be extremely limiting. All subsystems are being investigated for

potential ineciencies, including storage. This chapter presents the widest

survey of storage power use in HPC systems, which shows that today's stor-

age systems are not generally significant users of power. However, extrapolat-

ing potential disk sizes and storage requirements to exascale systems shows

that today's methods of constructing and using storage will be completely

inadequate from a power perspective.

Fortunately, there are several technologies under development that serve to

reduce the demands on I/O systems for bulk synchronous checkpoints. These

Search WWH ::

Custom Search

Home