Information Technology Reference
In-Depth Information
data sources. Therefore, a simplified interface to
the user (a Grid job or other client) requires that
the essential information for a request should not
include the data source, but rather use a discovery
service to determine the relevant data source and
access method.
Depending on the job, the authors Jacob at al.
(2003) recommend to consider the following data-
related questions which refer to input as well as
output data of the jobs within the Grid application:
Is it reasonable that each job or set of jobs
accesses the data via the network?
Data Topology
Does it make sense to transport a job or set
of jobs to the data location?
Issues about the size of the data, network band-
width, and time sensitivity of data determine the
location of data for a Grid application. The total
amount of data within the Grid application may
exceed the amount of data input and output of
the Grid application, as there can be a series of
sub-jobs that produce data for other sub-jobs.
For permanent storage the Grid user needs to be
able to locate where the required storage space is
available in the grid. Other temporary data sets
that may need to be copied from or to the client
also need to be considered.
The amount of data that has to be transported
over the network is restricted by available band-
width. Less bandwidth requires careful planning of
the data traffic among the distributed components
of a Grid application at runtime. Compression and
decompression techniques are useful to reduce the
data amount to be transported over the network.
But in turn, it raises the issue of consistent tech-
niques on all involved nodes. This may exclude
the utilization of scavenging for a grid, if there
are no agreed standards universally available.
Another issue in this context is time-sensitive
data. Some data may have a certain lifetime,
meaning its values are only valid during a defined
time period. The jobs in a Grid application have
to reflect this in order to operate with valid data
when executing. Especially when using data
caching or other replication techniques, it has to
be assured that the data used by the jobs is up-
to-date, at any given point in time. The order of
data processing by the individual jobs, especially
the production of input data for subsequent jobs,
has to be carefully observed.
Is there any data access server (for exam-
ple, implemented as a federated database)
that allows access by a job locally or re-
motely via the network?
Are there time constraints for data trans-
port over the network, for example, to
avoid busy hours and transport the data
to the jobs in a batch job during off-peak
hours?
Is there a caching system available on the
network to be exploited for serving the
same data to several consuming jobs?
Is the data only available in a unique loca-
tion for access, or are there replicas that are
closer to the executable within the grid?
Data Volume
The ability for a Grid job to access the data it needs
will affect the performance of the application.
When the data involved is either a large amount
of data or a subset of a very large data set, then
moving the data set to the execution node is not
always feasible. Some of the considerations as to
what is feasible include the volume of the data
to be handled, the bandwidth of the network, and
logical interdependences on the data between
multiple jobs.
Data volume issues: In a Grid application,
transparent access to its input and output data is
required. In most cases the relevant data is per-
manently located on remote locations and the jobs
are likely to process local copies. This access to
the data results in a network cost and it must be
carefully quantified. Data volume and network
Search WWH ::




Custom Search