Resilience - High Performance Parallel I/O

Hardware Reference

In-Depth Information

Chapter 31

Resilience

Gary Grider and Nathan DeBardeleben

Los Alamos National Laboratory

31.1

Present ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 346

31.1.1

Getting the Correct Answer ::::::::::::::::::::::::::::: 347

31.2

Future :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 348

31.3

Conclusion :::::::::::::::::::::::::::::::::::::::::::::::::::::::: 350

Bibliography :::::::::::::::::::::::::::::::::::::::::::::::::::::: 351

The term \resilience" has become an overarching term that describes the re-

liability of both software and hardware in the high performance computing

field. Resilience came about, during a philosophical shift away from merely

tolerating faults and toward being able to ride through failure. While there

certainly are examples of resilience, particularly in hardware, there are rela-

tively few examples of true resilience at higher levels of the software/hardware

stack.

For many reasons, resilience is a key challenge for future supercomputing

systems. System reliability is intricately tied to the scale of hardware and

software components. Reliability has become a bigger problem as systems scale

to use petascale and exascale, and the predicted component counts for systems

of the future will only exacerbate this issue. Secondly, performance, power,

and reliability are interrelated and while strict requirements are being set for

performance (i.e., \exaop") and power (i.e., 20 MW), reliability requirements

are somewhat less constrained. Furthermore, decreases in supply voltage come

with a decrease in reliability [9].

While current supercomputers almost exclusively address reliability

through some form of checkpointing, there is concern that future HPC sys-

tems will need more elaborate tools to achieve resilience. However, rather

than force system reliability to some unachievable goal, there remains hope

for moving from the current systems that still tolerate faults to those that are

truly resilient to failure.

345

Search WWH ::

Custom Search

Home