Resilience - High Performance Parallel I/O

Hardware Reference

In-Depth Information

the risk of incorrect operation due to transients which explore these corner

cases grows. Recent experience justifies concerns about these risks, which are

already showing up as root causes of intermittent errors in large-scale ma-

chines.

One area of research with great potential impact is to break the reliance

on tightly coupled applications that are unable to handle faults in any used

software or hardware component. This is sometimes referred to as local failure

causing global failure and restart. Instead, approaches that focus on allowing

portions of a calculation to fail while other portions continue (perhaps at a

reduced accuracy) become important. Certainly, there is much more research

to be conducted in this area as the most widely used parallel programming

paradigm, MPI, does not readily facilitate this. Furthermore, application de-

velopers will need to be trained to design algorithms for this new style of

computation.

The notion of localizing failures is closely related to the concept of contain-

ment domains (CDs) [4]. CDs are essentially a form of transactional computing

brought to HPC programming. In this programming model, users \contain"

regions of an application by describing different failure domains. Then, an ad-

vanced compiler and/or runtime system can perform many reliability-related

tasks transparently for the user, such as voting for correctness and rollback.

CDs seem to show promising results and are likely to appear in some form in

future programming paradigms that target reliable computation.

31.3 Conclusion

HPC resilience is a problem that is growing in importance and recognition

with the size of the HPC systems themselves. Today, system interruptions are

a nuisance that can be addressed (at non-negligible cost) through defensive

checkpointing. Experts in government, industry, and academia believe the rate

of failures is increasing to the point that in the near future system failure will

no longer be the exception. As such, checkpointing is unlikely to be the only

way to address application reliability on future systems.

Furthermore, future systems are likely to become less reliable with respect

to application correctness. The HPC community will see more emphasis put

on user applications that can check a calculation for correctness. They will

also see new programming techniques built around defining reliable and un-

reliable portions of an application and accompanying hardware that can take

advantage of multiple levels of fidelity.

Resilience represents a great opportunity to develop applications that

can be agile, and adapt to unreliable hardware and software components.

There will be more opportunities available to develop hardware and software

Search WWH ::

Custom Search

Home