Coordination of Access to Large-Scale Datasets in Distributed Environments - Scientific Data Management

Database Reference

In-Depth Information

4.2 Background

In an effort to achieve reliable and ecient data placement, high-level data

management tools such as the reliable file transfer service (RFT), 61 the

lightweight data replicator (LDR), 51 and the data replication service (DRS) 20

were developed. The main motivation for these tools was to enable byte

streams to be transferred in a reliable manner by handling possible failures

like dropped connections, machine reboots, and temporary network outages

automatically via retrying. Most of these tools are built on top of GridFTP, 2

which is a secure and reliable data transfer protocol especially developed for

high-bandwidth wide area networks.

Beck et al. introduced logistical networking, 10 which performs global

scheduling and optimization of data movement, storage, and computation

based on a model that takes into account all the network's underlying physical

resources. Systems such as the storage resource broker (SRB) 8 and the storage

resource manager (SRM) 75 were developed to provide a uniform interface for

connecting to heterogeneous data resources. SRB provides a single front-end

that can access a variety of back-end storage systems. SRM is a standard

interface specification that permits multiple implementations of the standard

on top of storage systems. SRMs were discussed in detail in Chapter 3.

GFarm 65 provided a global parallel filesystem with online petascale storage.

Their model specifically targets applications where data primarily consists of

a set of records or objects that are analyzed independently. GFarm takes

advantage of this access locality to achieve a scalable I/O bandwidth using

a parallel file system integrated with process scheduling and file distribution.

The Open-source Project for a Network Data Access Protocol (OPeNDAP)

provides software which makes local multidimensional array data accessible to

remote locations regardless of local storage format. 66 OPeNDAP is discussed

in detail in Chapter 10.

OceanStore 54 aimed to build a global persistent data store that can scale to

billions of users. The basic idea is that any server may create a local replica

of any data object. These local replicas provide faster access and robustness

to network partitions. Both GFarm and OceanStore require creating several

replicas of the same data, but still they do not address the problem of schedul-

ing the data movement when there is no replica close to the computation site.

Bent et al. 12 introduced a new distributed file system, the Batch-Aware Dis-

tributed File System (BADFS), and a modified data-driven batch scheduling

system. 11 Their goal was to achieve data-driven batch scheduling by export-

ing explicit control of storage decisions from the distributed file system to

the batch scheduler. Using some simple data-driven scheduling techniques,

they have demonstrated that the new data-driven system can achieve or-

ders of magnitude throughput improvements both over current distributed

file systems such as the Andrew file system (AFS) as well as over traditional

CPU-centric batch scheduling techniques which are using remote I/O.

Scientific Data Management

Search WWH ::

Custom Search

Home