A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid - Cloud, Grid and High Performance Computing: Emerging Applications

Information Technology Reference

In-Depth Information

elaborating the need and integration of RBS with

a main scheduler. Working of the model using a

suitable example is illustrated next along with the

details of the results obtained from the simulation

study. The chapter finally concludes detailing the

achievements and drawbacks of the work.

Kuppuswam & Ragupathi, 2006). Many more

similar models are also available in the literature.

THE REPLICA BASED CO-

SCHEDULER (RBS)

Replication can be applied in many ways for grid

constituents to induce fault tolerance in the system.

Depending on the requirements and availability it

could be used at hardware or the software level.

These techniques do well irrespective of the al-

location strategy used by the scheduler but with

the increased cost of execution both in terms of

computational power and money. The degree and

type of replication introduced, thus depends on

the acceptable amount of failures the system can

digest. Since grid is a heterogeneous environ-

ment, the failures may occur at many levels viz.

the job may fail during the time of submission,

the computational resource may fail while job is

being scheduled or even after being scheduled,

the network links may fail while the job is inter-

acting with the user or within itself. Among all

these failures, those accounting to failed resources

or application before scheduling does not have

a serious effect as they can be taken up again

for scheduling. The problem is serious when

the resources fail while executing the jobs. The

most disastrous failure could be the node failure

on which the job is getting executed. Robustness

towards application failure and network failures

is difficult to attain but the node failure can be

handled a bit more easily if we have the informa-

tion about the allocation of various modules (jobs)

allocated on that node.

The proposed Replica Based Co-Scheduler

(RBS) helps in the reliable execution of the modu-

lar job by replicating the modules allocated to the

nodes with high failure rates (sick nodes) to the

ones with a lower failure rates (healthy nodes).

The reallocation is done only once for a module

based on the random selection of nodes out of all

the healthy nodes. This results in having duplicate

RELATED MODELS

The grid being an aggregation of geographically

distributed heterogeneous resources; the degree

of unreliable behavior extends from the compu-

tational resources to the applications running to

the network media. A reliable and fault tolerant

scheduling has gained enough attention from the

researchers and many models have been reported

in the literature addressing these issues. A few

models have been proposed to counter the effect

of these failures by adopting proactive to reactive

solutions. A reliability analysis of grid computing

systems has been done in (Dai, Xie, & Poh, 2002).

An agent oriented fault tolerant framework has

been proposed in (Huda, Schmidt & Peake, 2005)

to use agents to monitor the system and in case

of any threat appropriate measures may be taken

beforehand to prevent failures. A checkpoint-based

mechanism has been adopted for recovery from

failures from the last saved state as a reactive mea-

sure (Mujumdar, Bheevgade, Malik & Patrikar,

2008). Introduction of redundancy is a popular

means to safeguard the application, as reported

in many models in the literature. A study of the

tradeoff between performance and availability

has been carried out suggesting a file replication

strategy (Zhang & Honeyman, 2008). The use of

replication by determining the number of replicas

required and then suggesting a scheduling strat-

egy for the tasks submitted is reported in (Li &

Mascagni, 2003). Another fault tolerant strategy

using replication is proposed in (Liu, Wu, Ma, &

Cai, 2008) whereas a model using database centric

approach for static workload for data grid has been

proposed in (Desprez & Vernois, 2007; Sathya,

Cloud, Grid and High Performance Computing: Emerging Applications

Search WWH ::

Custom Search

Home