Database Reference
In-Depth Information
remote data is fetched during job execution. In
the DP strategy, each job is assigned to a node that
stores the job's required input data. Ranganathan
& Foster claim that, in most situations, DP has
better performance than LLS and RS (as doing
data movement across grid's nodes may be very
time consuming).
There are several parameters that should be
considered when scheduling data-centric jobs.
These include the size of the job's input and
output data, and the network bandwidth among
grid's nodes. Park & Kim (2003) present a cost
model that use such parameters to estimate job's
execution time at each node (both considering
that a job can be executed at the submission site
or not, and that it may use local or remote data
as input). Job execution is scheduled to the node
with the lowest predicted execution time.
Although very promising, the grid-enabled
database management systems were not largely
adopted for a long time (Nieto-Santisteban et al,
2005; Watson, 2001). Watson (2001) proposed the
construction of a federated system with the use
of ODBC/JDBC as interface for heterogeneous
database systems. In more recent work, web ser-
vices are used as interface to database management
systems. Alpdemir et al (2003) present an Open
Grid Services Architecture [OGSA - (Foster et al,
2002)]-compatible implementation of a distributed
query processor (Polar*). A distributed query
execution plan is constructed by basic operations
that are executed at several nodes.
Costa & Furtado (2008c) compares the use
of centralized and hierarchical query schedul-
ing strategies in grid-enabled databases. The
authors present that hierarchical schedulers can
be used without significant lose in the system's
performance and can also lead to good levels of
achievement of Service Level Objectives (SLOs).
In Costa & Furtado (2008b) the authors propose
the use of reputation systems to schedule deadline-
marked queries among grid-enabled databases
when several replicas of the same data are present
at distinct sites.
In Grids, data replicas are commonly used to
improve job (or query) execution performance
and data availability. Best Client and Cascading
Replication are among the dynamic file replica-
tion strategies evaluated by Ranganathan & Foster
(2001) to be used in the Grid. In both models, a
new file replica is created whenever the number
of access to an existent data file is greater than a
threshold value. The difference among the meth-
ods resides on where such new file is placed. The
'best client' of a certain data file is defined as the
node that has requested for each more times in a
certain time period. In the Best Client placement
strategy, the new replica is placed at the best cli-
ent node. In the Cascading Replication method,
the new file is placed at the first node in the path
between the node that stores the file that is being
replicated and the best client node.
The Best Client strategy is used as an inspi-
ration for the Best Replica Site strategy [(Siva
Sathya et al, 2006)]. The main different among
the this strategy and the original Best Client is
that in Best Replica Site the site in which the
replica is created is chosen considering not only
the number of access from clients to the dataset,
but also the replica's expected utility for each site
and the distance between sites. Sathya et al (2006)
also propose two other strategies: Cost Effective
Replication and Topology Based Replication . In
the first one, a cost function is used to choose in
which site a replica should be created (the cost
function evaluates the cost of accessing a replica
at each site). In the latter, database replicas are
created at the node that has the greatest number
of direct connections to other ones.
Topology related aspects are also considered
by Lin et al (2006) in order to choose replica loca-
tion. The authors consider a hierarchical (tree-like)
grid in which database is placed at the tree root.
Whenever a job is submitted, the scheduler looks
for the accessed data at the node in which the job
was submitted. If the necessary data is not at such
node, then the schedulers asks for it at the node's
parent node. If the parent node does not have a
Search WWH ::




Custom Search