Hardware Reference
In-Depth Information
6.2.4 The Curse of the Burst: Economic Thinking behind
Burst Buers
HPC I/O currently is dominated by extreme bursts of data, not only in
the checkpoint workload but also in analysis workloads. Since I/O time is non-
productive time in general, the goal is to make I/O time minimal. As machine
memory sizes get bigger, from 1 PB of memory to exascale-class machines
with 50{100 PB of memory, the size of bursts is becoming enormous. Further,
given the shrinking job mean time to interrupt (JMTTI), for the checkpoint
use case, the time for the I/O burst is shrinking. Figure 6.2 shows the effect
of the ratio of JMMT over checkpoint time. As JMTTI goes down, checkpoint
time must also go down non-linearly in order to keep machine utilization high.
To add to this dilemma of JMTTI and checkpoint times, disk drives (which
are used currently for checkpoint I/O) typically get denser, nearly to the
square of increases in streaming bandwidth. This means that the number of
FIGURE 6.2: Machine eciency JMMTI over time for checkpoint. As JMTTI
goes down, checkpoint time must also go down non-linearly in order to keep
machine utilization high. In the Great region, only a small percent (about 5%)
of the supercomputer's time is spent on checkpoint/restart (defensive I/O).
In the Good region, only about 15% or less of the time is being spent on
defensive I/O, and in the Bad region greater than 15% of the time is being
spent on defensive I/O. In conclusion, the eciency of the supercomputer
depends heavily on the defensive I/O mechanisms' abilities. [Image courtesy
of Josip Loncaric (LANL).]
 
Search WWH ::




Custom Search