Hardware Reference
In-Depth Information
the risk of incorrect operation due to transients which explore these corner
cases grows. Recent experience justifies concerns about these risks, which are
already showing up as root causes of intermittent errors in large-scale ma-
chines.
One area of research with great potential impact is to break the reliance
on tightly coupled applications that are unable to handle faults in any used
software or hardware component. This is sometimes referred to as local failure
causing global failure and restart. Instead, approaches that focus on allowing
portions of a calculation to fail while other portions continue (perhaps at a
reduced accuracy) become important. Certainly, there is much more research
to be conducted in this area as the most widely used parallel programming
paradigm, MPI, does not readily facilitate this. Furthermore, application de-
velopers will need to be trained to design algorithms for this new style of
computation.
The notion of localizing failures is closely related to the concept of contain-
ment domains (CDs) [4]. CDs are essentially a form of transactional computing
brought to HPC programming. In this programming model, users \contain"
regions of an application by describing different failure domains. Then, an ad-
vanced compiler and/or runtime system can perform many reliability-related
tasks transparently for the user, such as voting for correctness and rollback.
CDs seem to show promising results and are likely to appear in some form in
future programming paradigms that target reliable computation.
31.3 Conclusion
HPC resilience is a problem that is growing in importance and recognition
with the size of the HPC systems themselves. Today, system interruptions are
a nuisance that can be addressed (at non-negligible cost) through defensive
checkpointing. Experts in government, industry, and academia believe the rate
of failures is increasing to the point that in the near future system failure will
no longer be the exception. As such, checkpointing is unlikely to be the only
way to address application reliability on future systems.
Furthermore, future systems are likely to become less reliable with respect
to application correctness. The HPC community will see more emphasis put
on user applications that can check a calculation for correctness. They will
also see new programming techniques built around defining reliable and un-
reliable portions of an application and accompanying hardware that can take
advantage of multiple levels of fidelity.
Resilience represents a great opportunity to develop applications that
can be agile, and adapt to unreliable hardware and software components.
There will be more opportunities available to develop hardware and software
 
Search WWH ::




Custom Search