A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid - Cloud, Grid and High Performance Computing: Emerging Applications

Information Technology Reference

In-Depth Information

INTRODUCTION

challenges is to ensure a reliable environment to

the job so that it can cope with any kind of failure.

Since the grid resources are heterogeneous in be-

havior and administrative control, introduction of

fault tolerance in the system is very difficult. In

addition, the jobs demanding execution on the grid

themselves may be very complex and may take

a long time to execute making them vulnerable

to failures. Further, the resources are under the

user control so even accidental damages or even

a forced shutdown may fail the execution. Similar

is true for the network failure also. These failures

may range from hardware to software and to the

network failures. The fault tolerant techniques can

thus vary from proactive to reactive approaches

to counter failure at any level (Dai, Xie, & Poh,

2002; Huda, Schmidt & Peake, 2005; Mujumdar,

Bheevgade, Malik & Patrikar, 2008). In spite of

these measures, the chances of failures cannot be

overruled. The desired objective is to accept these

failures and minimize their effect by gracefully

degrading the system with continued job execution

at the cost of a compromised overall performance.

One of the popular mechanisms to handle failures

is to introduce replication. This could be in the

hardware form or the software form in which same

application is executed or stored at more than one

resources. Therefore, with the slight increase in

the execution cost, replication increases the prob-

ability of the successful execution of the job, thus

being fault tolerant.

Replication incurs a heavy cost but this cost can

be minimized by adopting selective replication.

The selection of nodes or job modules depends

on certain parameters that can be decided by the

system as per the scheduling requirements. The

RBS works on the basis of replicating some of the

modules allocated on a node with high failure rate

on to those nodes with lesser failure rate. There-

fore, it increases the fault tolerance of the system

without severely affecting the performance.

This paper has six sections. Next section dis-

cusses the related work reported in the literature

with the similar objective followed by a section

Computational resources being scarce requires an

efficient use of these resources. Resources may

vary from specialized computational machines,

storage machines to heterogeneous applications.

Grid is the aggregation of the resources across the

world seamlessly and enabling their use as, when

and wherever desired rather than individual group

investing heavily for high performance computa-

tional resources. In the era of high performance

and high throughput computing, grid has emerged

as an efficient means of connecting distributed

computers or resources scattered all over the

world for the purpose of collaborative computing

thus essentially unifying various heterogeneous

resources on a common platform while dimin-

ishing the administrative boundaries to provide

a transparent access to a user. Essentially being

a part of the grid means an infinite capability to

execute and compute any kind of job anywhere

by simply becoming its part. Therefore, even if

the appropriate computational capabilities are not

available with the user, the grid helps the job to

be executed on the right resources thereby being

efficient as well as cost effective.

Depending on the use grids can be classi-

fied as Computational grid, Data grid, Sensor

grid, Biological grid etc. A computational grid

emphasizes on the computing aspect thus sched-

uling the job to the grid resources by exploring

the computational requirements of the job and

effectively load balancing it. Scheduling can

be based on various objectives like maximizing

the reliability of job execution, minimizing the

make span or maximizing the Quality of Service

(QoS) for the job execution (Grid Computing

Info centre, 2008; Baker, Buyya, & Laforenza,

2002; Tarricone & Esposito, 2005; Ernemann,

Hamscher, & Yahyapour, 2002; Casanova, 2002;

Vidyarthi, Sarker, Tripathi & Yang, 2009; Raza

& Vidyarthi, 2008, 2009).

Execution of a job on the complex and dynamic

grid poses number of challenges. One of these

Cloud, Grid and High Performance Computing: Emerging Applications

Search WWH ::

Custom Search

Home