Parallel Data Storage and Access - Scientific Data Management

Database Reference

In-Depth Information

2.5 Future Trends and Challenges

High-performance computing systems continue to grow in computational ca-

pability and number of processors, and this trend shows no signs of changing.

Parallel storage systems must adapt to provide the necessary storage facilities

to these ever larger systems. In this section we discuss some of the challenges

and technologies that will affect the design and implementation of parallel

data storage in the coming years.

2.5.1 Disk Failures

With petascale computers now arriving, there is a pressing need to antici-

pate and compensate for a probable increase in failure and application in-

terruption rates and in degrading performance caused by online failure re-

covery. Researchers, designers, and integrators have generally had too lit-

tle detailed information available on the failures and interruptions that even

smaller terascale computers experience. The available information suggests

that failure recovery will become far more common in the coming decade

and that the condition of recovering online from a storage device failure may

become so common as to change the way we design and measure system

performance.

The SciDAC Petascale Data Storage Institute (PDSI, www.pdsi-scidac.org)

has collected and analyzed a number of large datasets on failures in high-

performance computing (HPC) systems. 28 The primary dataset was collected

during 1995-2005 at Los Alamos National Laboratory (LANL) and covers 22

high-performance computing systems, including a total of 4,750 machines and

24,101 processors. The data covers node outages in HPC clusters, as well as

failures in storage systems. This may be the largest failure dataset studied

in the literature to date, in terms of both the time period it spans and the

number of systems and processors it covers. It is also the first to be pub-

licly available to researchers (see Reference 29 for access to the raw data).

These datasets and large-scale trends and assumptions commonly applied to

future computing systems design have been used to project onto the potential

machines of the next decade and derive expectations for failure rates, mean

time to application interruption, and the consequential application utilization

of the full machine, based on checkpoint/restart fault tolerance and the bal-

anced system design method of matching storage bandwidth and memory size

to aggregate computing power. 30 If the growth in aggregate computing power

continues to outstrip the growth in per-chip computing power, more and more

of the computer's resources may be spent on conventional fault recovery meth-

ods. Highly parallel simulation applications may be denied as much as half of

the system's resources in five years, for example. New research on application

fault-tolerance schemes for these applications should be pursued; for example,

Search WWH ::

Custom Search

Home