A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid - Cloud, Grid and High Performance Computing: Emerging Applications

Information Technology Reference

In-Depth Information

copies of the modules on more than one node.

In case of a node failure, the duplicate copies of

the modules continues for the job execution. The

duplicate copies are used only when a node fails

otherwise the job is executed as per the originally

scheduled allocation. The job of the RBS starts

when the job of the main scheduler in allocating

the job modules to various nodes has finished.

It is then that the RBS takes control to provide

robustness and fault tolerance to the cluster con-

taining the computational resources. The RBS can

be used along with any scheduler available in the

grid middleware. The inclusion of RBS enables

the grid to respond graciously to the node failures

with the cost of compromising the performance of

the grid, which is unavoidable since the replicated

modules have an altered sequence of execution as

compared to the original schedule. RBS strategy

provides an important backup in absence of which

the job needs to be scheduled afresh again result-

ing in consumption of computational energy that

proves very costly for the high traffic environment

such as grid. For the real time jobs the problem

becomes much more severe as the failures may

impact he grid performance thus hitting the fi-

nancial prospects of the grid.

minimized. In the present work, the performance

of the RBS has been analyzed by integrating it

with a TSM scheduler.

The TSM model considers the grid as collec-

tion of many clusters, each with a specialization,

consisting of a number of nodes for job execution.

This is a multipoint entry grid in which the job can

be fired at any node of the constituent clusters.

The main scheduler (TSM) searches for the ap-

propriate cluster matching the job's requirements

and offering the minimum turnaround time to the

job, on which the job is eventually scheduled. The

job is submitted for execution along with its Job

Precedence and Dependence Graph (JPDG) in

which the position of each module of the job indi-

cates its order of execution. It also depicts degree

of parallelism and the interaction dependence of

that module with the preceding modules in terms

of the communication requirements.

The allocation status of the various jobs is

maintained with each cluster in a data structure

known as the Cluster Table (CT), which is updated

periodically to reflect updated allocations. The CT

consists of the following attributes

C n (S n , P k , f k , λ lt , M ij , T prkn )

Where C n refers to the cluster under consider-

ation with specialization S n , number of nodes P k ,

the clock frequency of each node f k , failure rate

of each node λ lt , modules assigned on the nodes

M ij and the time to finish existing modules T prkn

on the nodes. As obvious, the CT provides the

information regarding the cluster constituents

e.g. the specialization of the cluster nodes to help

allocating the jobs to appropriate resources as

per its requirements and specifications, number

of nodes in the cluster, their clock frequency, the

failure rate of nodes, present allocation, and the

time taken to finish the existing modules already

allocated on the nodes. The main scheduler in this

case is TSM but it can be any scheduler proposing

a scheduling strategy for the modular job. Since the

objective of the TSM is to minimize the turnaround

INTEGRATION OF RBS WITH TSM

To analyze the performance of the co-scheduler

RBS it is essential to have a scheduler, which

schedules the job submitted to the grid on ap-

propriate resources based on certain optimiza-

tion parameter. These parameters may vary e.g.

turnaround time, reliability, security, Quality of

Service (QoS) etc. Minimizing the turnaround time

for the job submitted is often a desired parameter

and has been addressed in the Turnaround Based

Scheduling Model (TSM) for computational grids

using Genetic Algorithm (GA) in [8]. The TSM

model uses GA to schedule a modular job on a

cluster based grid to suggest an allocation pattern

in such a way that the turnaround time of the job is

Cloud, Grid and High Performance Computing: Emerging Applications

Search WWH ::

Custom Search

Home