Database Reference
In-Depth Information
4.1 Introduction
Modern scientific applications and experiments become increasingly data in-
tensive. Large experiments, such as high-energy physics simulations, genome
mapping, and climate modeling generate data volumes reaching hundreds of
terabytes. 41 Similarly, remote sensors and satellites are producing extremely
large amounts of data for scientists. 19 , 82 In order to process these data, sci-
entists are turning toward distributed resources owned by the collaborating
parties to provide them the computing power and storage capacity needed
to push their research forward. But the use of distributed resources imposes
new challenges. 52 Even simply sharing and disseminating subsets of the data
to the scientists' home institutions is dicult. The systems managing these
resources must provide robust scheduling and allocation of storage and net-
working resources, as well as ecient management of data movement.
One benefit of distributed resources is that it allows institutions and organi-
zations to gain access to resources needed for large-scale applications that they
would not otherwise have. But in order to facilitate the sharing of compute,
storage, and network resources between collaborating parties, middleware is
needed for planning, scheduling, and management of the tasks as well as the re-
sources. Existing research in this area has mainly focused on the management
of compute tasks and resources, as they are widely considered to be the most
expensive. As scientific applications become more data intensive, however, the
management of storage resources and data movement between the storage and
compute resources is becoming the main bottleneck. Many jobs executing in
distributed environments fail or are inhibited by overloaded storage servers.
These failures prevent scientists from making progress in their research.
According to the Strategic Plan for the U.S. Climate Change Science Pro-
gram (CCSP), one of the main objectives of the future research programs
should be “Enhancing the data management infrastructure,” since “the users
should be able to focus their attention on the information content of the data,
rather than how to discover, access, and use it.” 18 This statement by CCSP
summarizes the goal of many cyberinfrastructure efforts initiated by DOE,
NSF, and other federal agencies, as well as the research direction of several
leading academic institutions.
Accessing and transferring widely distributed data can be extremely ine-
cient and can introduce unreliability. For instance, an application may suffer
from insucient storage space when staging in the input data, generating
the output, and staging out the generated data to a remote storage. This
can lead to trashing of the storage server and subsequent timeout due to too
many concurrent read data transfers, ultimately causing server crashes due
to an overload of write data transfers. Other third-party data transfers may
stall indefinitely due to loss of acknowledgment. And even if transfer is per-
formed eciently, faulty hardware involved in staging and hosting can cause
data corruption. Furthermore, remote access will suffer from unforeseeable
Search WWH ::




Custom Search