Porting HPC Applications to Grids and Clouds - Cloud, Grid and High Performance Computing: Emerging Applications

Information Technology Reference

In-Depth Information

data sources. Therefore, a simplified interface to

the user (a Grid job or other client) requires that

the essential information for a request should not

include the data source, but rather use a discovery

service to determine the relevant data source and

access method.

Depending on the job, the authors Jacob at al.

(2003) recommend to consider the following data-

related questions which refer to input as well as

output data of the jobs within the Grid application:

•

Is it reasonable that each job or set of jobs

accesses the data via the network?

Data Topology

•

Does it make sense to transport a job or set

of jobs to the data location?

Issues about the size of the data, network band-

width, and time sensitivity of data determine the

location of data for a Grid application. The total

amount of data within the Grid application may

exceed the amount of data input and output of

the Grid application, as there can be a series of

sub-jobs that produce data for other sub-jobs.

For permanent storage the Grid user needs to be

able to locate where the required storage space is

available in the grid. Other temporary data sets

that may need to be copied from or to the client

also need to be considered.

The amount of data that has to be transported

over the network is restricted by available band-

width. Less bandwidth requires careful planning of

the data traffic among the distributed components

of a Grid application at runtime. Compression and

decompression techniques are useful to reduce the

data amount to be transported over the network.

But in turn, it raises the issue of consistent tech-

niques on all involved nodes. This may exclude

the utilization of scavenging for a grid, if there

are no agreed standards universally available.

Another issue in this context is time-sensitive

data. Some data may have a certain lifetime,

meaning its values are only valid during a defined

time period. The jobs in a Grid application have

to reflect this in order to operate with valid data

when executing. Especially when using data

caching or other replication techniques, it has to

be assured that the data used by the jobs is up-

to-date, at any given point in time. The order of

data processing by the individual jobs, especially

the production of input data for subsequent jobs,

has to be carefully observed.

•

Is there any data access server (for exam-

ple, implemented as a federated database)

that allows access by a job locally or re-

motely via the network?

•

Are there time constraints for data trans-

port over the network, for example, to

avoid busy hours and transport the data

to the jobs in a batch job during off-peak

hours?

•

Is there a caching system available on the

network to be exploited for serving the

same data to several consuming jobs?

•

Is the data only available in a unique loca-

tion for access, or are there replicas that are

closer to the executable within the grid?

Data Volume

The ability for a Grid job to access the data it needs

will affect the performance of the application.

When the data involved is either a large amount

of data or a subset of a very large data set, then

moving the data set to the execution node is not

always feasible. Some of the considerations as to

what is feasible include the volume of the data

to be handled, the bandwidth of the network, and

logical interdependences on the data between

multiple jobs.

Data volume issues: In a Grid application,

transparent access to its input and output data is

required. In most cases the relevant data is per-

manently located on remote locations and the jobs

are likely to process local copies. This access to

the data results in a network cost and it must be

carefully quantified. Data volume and network

Cloud, Grid and High Performance Computing: Emerging Applications

Search WWH ::

Custom Search

Home