Database Reference
In-Depth Information
2.5 Future Trends and Challenges
High-performance computing systems continue to grow in computational ca-
pability and number of processors, and this trend shows no signs of changing.
Parallel storage systems must adapt to provide the necessary storage facilities
to these ever larger systems. In this section we discuss some of the challenges
and technologies that will affect the design and implementation of parallel
data storage in the coming years.
2.5.1 Disk Failures
With petascale computers now arriving, there is a pressing need to antici-
pate and compensate for a probable increase in failure and application in-
terruption rates and in degrading performance caused by online failure re-
covery. Researchers, designers, and integrators have generally had too lit-
tle detailed information available on the failures and interruptions that even
smaller terascale computers experience. The available information suggests
that failure recovery will become far more common in the coming decade
and that the condition of recovering online from a storage device failure may
become so common as to change the way we design and measure system
performance.
The SciDAC Petascale Data Storage Institute (PDSI, www.pdsi-scidac.org)
has collected and analyzed a number of large datasets on failures in high-
performance computing (HPC) systems. 28 The primary dataset was collected
during 1995-2005 at Los Alamos National Laboratory (LANL) and covers 22
high-performance computing systems, including a total of 4,750 machines and
24,101 processors. The data covers node outages in HPC clusters, as well as
failures in storage systems. This may be the largest failure dataset studied
in the literature to date, in terms of both the time period it spans and the
number of systems and processors it covers. It is also the first to be pub-
licly available to researchers (see Reference 29 for access to the raw data).
These datasets and large-scale trends and assumptions commonly applied to
future computing systems design have been used to project onto the potential
machines of the next decade and derive expectations for failure rates, mean
time to application interruption, and the consequential application utilization
of the full machine, based on checkpoint/restart fault tolerance and the bal-
anced system design method of matching storage bandwidth and memory size
to aggregate computing power. 30 If the growth in aggregate computing power
continues to outstrip the growth in per-chip computing power, more and more
of the computer's resources may be spent on conventional fault recovery meth-
ods. Highly parallel simulation applications may be denied as much as half of
the system's resources in five years, for example. New research on application
fault-tolerance schemes for these applications should be pursued; for example,
Search WWH ::




Custom Search