Hardware Reference
In-Depth Information
Chapter 31
Resilience
Gary Grider and Nathan DeBardeleben
Los Alamos National Laboratory
31.1
Present ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 346
31.1.1
Getting the Correct Answer ::::::::::::::::::::::::::::: 347
31.2
Future :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 348
31.3
Conclusion :::::::::::::::::::::::::::::::::::::::::::::::::::::::: 350
Bibliography :::::::::::::::::::::::::::::::::::::::::::::::::::::: 351
The term \resilience" has become an overarching term that describes the re-
liability of both software and hardware in the high performance computing
field. Resilience came about, during a philosophical shift away from merely
tolerating faults and toward being able to ride through failure. While there
certainly are examples of resilience, particularly in hardware, there are rela-
tively few examples of true resilience at higher levels of the software/hardware
stack.
For many reasons, resilience is a key challenge for future supercomputing
systems. System reliability is intricately tied to the scale of hardware and
software components. Reliability has become a bigger problem as systems scale
to use petascale and exascale, and the predicted component counts for systems
of the future will only exacerbate this issue. Secondly, performance, power,
and reliability are interrelated and while strict requirements are being set for
performance (i.e., \exaop") and power (i.e., 20 MW), reliability requirements
are somewhat less constrained. Furthermore, decreases in supply voltage come
with a decrease in reliability [9].
While current supercomputers almost exclusively address reliability
through some form of checkpointing, there is concern that future HPC sys-
tems will need more elaborate tools to achieve resilience. However, rather
than force system reliability to some unachievable goal, there remains hope
for moving from the current systems that still tolerate faults to those that are
truly resilient to failure.
345