Resilience - High Performance Parallel I/O

Hardware Reference

In-Depth Information

31.1 Present

Application fault tolerance on today's systems is almost exclusively han-

dled through defensive I/O checkpointing. Checkpointing (also discussed in

Chapters 19 and 23) is by and large the only reliable (albeit expensive) means

of recovering from application interrupts.

For the most part, applications that run on current supercomputers are

unable to handle any failure of any of the components being used by that

application. This means that an application running across 10,000 nodes of a

supercomputer will entirely fail when one node experiences an unrecoverable

memory error. Imagine if your car broke down every time it hit a pot hole!

That is what today's applications are like.

Obviously there are exceptions to this extreme brittleness. While there

are certainly interesting and promising research approaches to move the field

beyond this mode of computation, they have yet to gain enough traction to

see widespread deployment. These techniques range from new programming

languages, models, and paradigms to hardware-assisted fault avoidance and

recovery.

HPC systems today are almost exclusively built out of commodity compo-

nents. Some systems will have small portions that are proprietary: a custom

compute node kernel, an ultra-high-speed network, a co-processor, etc. How-

ever, by and large, the systems are assembled out of the same components

that are used by consumers at home with slight improvements. One exam-

ple of such an improvement is server-grade dual in-line memory modules, or

DIMMs (with advanced error protection).

As individual system components are designed to be reliable for a con-

sumer, each piece is expected to last around five years. However, when you

take many hundreds of thousands of parts and assemble them together into a

world-class supercomputer with requirements that the system be stable and

compute accurately, problems arise.

Large-scale supercomputing centers are usually somewhat shy about shar-

ing failure rates on these systems, but on the largest supercomputers of today,

the application mean time between failure (AMTBF) is on the order of 8{24

hours. This means that an application running across the entire machine will

see an interrupt one to several times a day unless it can ride through fail-

ure. As aforementioned, today's applications largely cannot do so, and hence

checkpointing to prepare for this impending failure becomes imperative.

There are a myriad of metrics around supercomputer reliability (MTBF,

system MTBF, application MTBF, mean uptime, mean time to repair, etc.).

While the HPC resilience community has released definitions of these terms

before [1, 10], not everyone is in full agreement. As such, not everyone measures

these metrics the same way and it can be dicult to compare numbers from

different data centers.

Search WWH ::

Custom Search

Home