Hardware Reference
In-Depth Information
31.1.1 Getting the Correct Answer
When people talk about supercomputer reliability they have, historically,
almost exclusively spoken of faults that cause detectable errors. Examples in-
clude a node crashing due to an unrecoverable memory error, a crash of a
software middleware or hardware driver, a power glitch that causes a machine
to halt, or an unreachable file system (for whatever reason) when an appli-
cation tries to use it. Each of these examples have detectable signatures that
cause a change in the application. In most examples, this simply causes the
application to fail, and data needs to be recovered from a recent checkpoint.
These are not the only types of faults on systems. Indeed, there are faults
that make undetectable changes to a computation. Like all faults, these can
be transient, intermittent, or permanent. Transient faults are usually caused
by environmental effects, such as cosmic radiation and alpha particles emit-
ted by radioactive impurities in the electronics packaging. Intermittent faults
are often the effect of temperature or voltage extremes and variations. These
effects can sometimes grow into permanent failures.
All of these effects can and do cause computing hardware to perform in-
correctly. In the case of data storage, particularly in memory (main memory,
caches, registers, etc.), most supercomputers today employ advanced error
correction codes. These codes can identify and correct a large number of these
errors. However, as the amount of memory on supercomputers has grown,
so has the required complexity of these error correction codes such that a
non-negligible amount of energy is expended to ensure correctness. Still, some
data corruption will slip past these codes and the extreme scale of today's
systems continues to make that increasingly likely. Additionally, it is possible
that computational logic can corrupt results.
Supercomputers are used in a wide range of application spaces. While those
applications undoubtedly require different levels of precision, it is important
to understand that supercomputers (and indeed, all computers) cannot be
viewed as entirely reliable digital machines. Instead, users must check the
integrity of their calculations in as many ways as possible.
One such way that is gaining in popularity is algorithm-based fault-
tolerance (ABFT). This technique embraces fundamental algorithmic changes
that allow an application to check for correctness and recompute corrupted
data. Examples of this are few and far between and due to the nature of
ABFT, there is little that can be generalized across different classes of appli-
cations. As such, ABFT may find use only in a portion of supercomputing
application fields while others require some more special techniques.
Luckily, on current systems, data corruption appears to be rare. By the
very nature of the faults being undetectable it is likely that the HPC commu-
nity does not have an accurate understanding of data corruption rates.
The fundamental reason the HPC community is not seeing complex and
innovative systems for dynamically adapting to failures is that the problem
 
Search WWH ::




Custom Search