Hardware Reference
In-Depth Information
has not, to date, been large enough to warrant serious investment in resilience,
although this may be changing in the near future.
31.2 Future
Several reports [2, 7, 12] state the increasing need for resilient, high per-
formance computing systems as HPC continues to strive for larger and larger
supercomputer deployments. Different groups of leaders in the HPC resilience
field have issued reports [6, 5, 8, 11, 3] on the challenges, opportunities, and
suggested approaches to field a reliable supercomputer in the exascale time-
frame.
As component counts of future systems continue to grow to staggering
numbers, so do the reliability concerns. Leadership-class supercomputers in
the 2020 timeframe are likely to contain between 32 and 100 petabytes of
main memory, a 100 to 350 increase compared to the 2012 levels [2]. If
implemented in DRAM DIMMs, the amount of DIMMs alone on a machine of
this scale will make failure rates (both due to hard and soft faults) extremely
challenging. Similar increases are expected in the amount of cache memory
(SRAM) on future systems.
Several studies [2, 12], suggest that without aggressive engineering, crude
projections show that an exascale supercomputer will have a failure every 10{
50 minutes. Clearly, saving a checkpoint for an application at scale, consuming
much of the supercomputer's main memory, is a serious challenge in that tens
of minutes timeframe. Engineering approaches are certainly possible, however
they are likely to be costly and it is unlikely to align well with commodity
desktop or server demands. Therefore, there is great potential for innovative
approaches to resilience in this timeframe so that the HPC community is not
forced to pay for fault-hardened expensive hardware.
Figure 31.1 shows a prediction of different types of errors and how they
will increase as the process technology scales [5]. In particular, this shows that
the amount of soft-error-related faults in the logic latches is predicted to rise.
Although not included in this figure, there is also a great deal of uncertainty
about the aging effects in new process technology.
Additionally, as the industry continues to drive down power usage by mov-
ing to near-threshold voltage, it is expected that power-related error rates will
also rise. Figure 31.2 shows that, historically, as process technology has scaled
downward, each new generation has become less susceptible to soft errors when
voltage was also scaled downward [9]. Even though this trend is promising,
there appears to at least be some effect from voltage scaling and that will
probably continue.
To ensure correct operation, all circuits should operate within their specific
environmental envelope at all times, over the operational life of the machine.
 
Search WWH ::




Custom Search