Resilience - High Performance Parallel I/O

Hardware Reference

In-Depth Information

31.1.1 Getting the Correct Answer

When people talk about supercomputer reliability they have, historically,

almost exclusively spoken of faults that cause detectable errors. Examples in-

clude a node crashing due to an unrecoverable memory error, a crash of a

software middleware or hardware driver, a power glitch that causes a machine

to halt, or an unreachable file system (for whatever reason) when an appli-

cation tries to use it. Each of these examples have detectable signatures that

cause a change in the application. In most examples, this simply causes the

application to fail, and data needs to be recovered from a recent checkpoint.

These are not the only types of faults on systems. Indeed, there are faults

that make undetectable changes to a computation. Like all faults, these can

be transient, intermittent, or permanent. Transient faults are usually caused

by environmental effects, such as cosmic radiation and alpha particles emit-

ted by radioactive impurities in the electronics packaging. Intermittent faults

are often the effect of temperature or voltage extremes and variations. These

effects can sometimes grow into permanent failures.

All of these effects can and do cause computing hardware to perform in-

correctly. In the case of data storage, particularly in memory (main memory,

caches, registers, etc.), most supercomputers today employ advanced error

correction codes. These codes can identify and correct a large number of these

errors. However, as the amount of memory on supercomputers has grown,

so has the required complexity of these error correction codes such that a

non-negligible amount of energy is expended to ensure correctness. Still, some

data corruption will slip past these codes and the extreme scale of today's

systems continues to make that increasingly likely. Additionally, it is possible

that computational logic can corrupt results.

Supercomputers are used in a wide range of application spaces. While those

applications undoubtedly require different levels of precision, it is important

to understand that supercomputers (and indeed, all computers) cannot be

viewed as entirely reliable digital machines. Instead, users must check the

integrity of their calculations in as many ways as possible.

One such way that is gaining in popularity is algorithm-based fault-

tolerance (ABFT). This technique embraces fundamental algorithmic changes

that allow an application to check for correctness and recompute corrupted

data. Examples of this are few and far between and due to the nature of

ABFT, there is little that can be generalized across different classes of appli-

cations. As such, ABFT may find use only in a portion of supercomputing

application fields while others require some more special techniques.

Luckily, on current systems, data corruption appears to be rare. By the

very nature of the faults being undetectable it is likely that the HPC commu-

nity does not have an accurate understanding of data corruption rates.

The fundamental reason the HPC community is not seeing complex and

innovative systems for dynamically adapting to failures is that the problem

High Performance Parallel I/O

Search WWH ::

Custom Search

Home