Coordination of Access to Large-Scale Datasets in Distributed Environments - Scientific Data Management

Database Reference

In-Depth Information

4.1 Introduction

Modern scientific applications and experiments become increasingly data in-

tensive. Large experiments, such as high-energy physics simulations, genome

mapping, and climate modeling generate data volumes reaching hundreds of

terabytes. 41 Similarly, remote sensors and satellites are producing extremely

large amounts of data for scientists. 19 , 82 In order to process these data, sci-

entists are turning toward distributed resources owned by the collaborating

parties to provide them the computing power and storage capacity needed

to push their research forward. But the use of distributed resources imposes

new challenges. 52 Even simply sharing and disseminating subsets of the data

to the scientists' home institutions is dicult. The systems managing these

resources must provide robust scheduling and allocation of storage and net-

working resources, as well as ecient management of data movement.

One benefit of distributed resources is that it allows institutions and organi-

zations to gain access to resources needed for large-scale applications that they

would not otherwise have. But in order to facilitate the sharing of compute,

storage, and network resources between collaborating parties, middleware is

needed for planning, scheduling, and management of the tasks as well as the re-

sources. Existing research in this area has mainly focused on the management

of compute tasks and resources, as they are widely considered to be the most

expensive. As scientific applications become more data intensive, however, the

management of storage resources and data movement between the storage and

compute resources is becoming the main bottleneck. Many jobs executing in

distributed environments fail or are inhibited by overloaded storage servers.

These failures prevent scientists from making progress in their research.

According to the Strategic Plan for the U.S. Climate Change Science Pro-

gram (CCSP), one of the main objectives of the future research programs

should be “Enhancing the data management infrastructure,” since “the users

should be able to focus their attention on the information content of the data,

rather than how to discover, access, and use it.” 18 This statement by CCSP

summarizes the goal of many cyberinfrastructure efforts initiated by DOE,

NSF, and other federal agencies, as well as the research direction of several

leading academic institutions.

Accessing and transferring widely distributed data can be extremely ine-

cient and can introduce unreliability. For instance, an application may suffer

from insucient storage space when staging in the input data, generating

the output, and staging out the generated data to a remote storage. This

can lead to trashing of the storage server and subsequent timeout due to too

many concurrent read data transfers, ultimately causing server crashes due

to an overload of write data transfers. Other third-party data transfers may

stall indefinitely due to loss of acknowledgment. And even if transfer is per-

formed eciently, faulty hardware involved in staging and hosting can cause

data corruption. Furthermore, remote access will suffer from unforeseeable

Scientific Data Management

Search WWH ::

Custom Search

Home