Hardware Reference
In-Depth Information
31.1 Present
Application fault tolerance on today's systems is almost exclusively han-
dled through defensive I/O checkpointing. Checkpointing (also discussed in
Chapters 19 and 23) is by and large the only reliable (albeit expensive) means
of recovering from application interrupts.
For the most part, applications that run on current supercomputers are
unable to handle any failure of any of the components being used by that
application. This means that an application running across 10,000 nodes of a
supercomputer will entirely fail when one node experiences an unrecoverable
memory error. Imagine if your car broke down every time it hit a pot hole!
That is what today's applications are like.
Obviously there are exceptions to this extreme brittleness. While there
are certainly interesting and promising research approaches to move the field
beyond this mode of computation, they have yet to gain enough traction to
see widespread deployment. These techniques range from new programming
languages, models, and paradigms to hardware-assisted fault avoidance and
recovery.
HPC systems today are almost exclusively built out of commodity compo-
nents. Some systems will have small portions that are proprietary: a custom
compute node kernel, an ultra-high-speed network, a co-processor, etc. How-
ever, by and large, the systems are assembled out of the same components
that are used by consumers at home with slight improvements. One exam-
ple of such an improvement is server-grade dual in-line memory modules, or
DIMMs (with advanced error protection).
As individual system components are designed to be reliable for a con-
sumer, each piece is expected to last around five years. However, when you
take many hundreds of thousands of parts and assemble them together into a
world-class supercomputer with requirements that the system be stable and
compute accurately, problems arise.
Large-scale supercomputing centers are usually somewhat shy about shar-
ing failure rates on these systems, but on the largest supercomputers of today,
the application mean time between failure (AMTBF) is on the order of 8{24
hours. This means that an application running across the entire machine will
see an interrupt one to several times a day unless it can ride through fail-
ure. As aforementioned, today's applications largely cannot do so, and hence
checkpointing to prepare for this impending failure becomes imperative.
There are a myriad of metrics around supercomputer reliability (MTBF,
system MTBF, application MTBF, mean uptime, mean time to repair, etc.).
While the HPC resilience community has released definitions of these terms
before [1, 10], not everyone is in full agreement. As such, not everyone measures
these metrics the same way and it can be dicult to compare numbers from
different data centers.
 
Search WWH ::




Custom Search