A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid - Cloud, Grid and High Performance Computing: Emerging Applications

Information Technology Reference

In-Depth Information

E ijkn = I i * (1/f k ) + n * α

(iii)

Here, ClusRel jn as stated in eq. (iii) is the reli-

ability offered to the job J j without node failure

and K C I accounts for the failure of 'I' nodes out

of the available 'K' nodes on which original al-

location has been made.

x ijk is the vector indicating the assignment of

module m i of job J j on node P k . It assumes a bi-

nary value. It is 1 if the module is allocated to the

node and is 0 otherwise. T prkn is the time to finish

execution of the present modules on the node P k .

RBS Algorithm

−

∑ 1

(

)

The factor

w B D

x x

represents the

ihj

ijk

hjl

The TSM essentially schedules the job on the clus-

ter offering the minimum turnaround from a group

of clusters with matching specialization of the job.

Once the cluster is selected for job allocation, its

Cluster Table (CT) is updated to accommodate

the new job. The job of the RBS begins where the

job of TSM finishes. For the cluster selected, the

RBS evaluates the vulnerability of the nodes on

which an allocation has been done by comparing

their failure rates λ lt with some threshold failure

rate λ th which depends on the domain knowledge

of the cluster along with the acceptance level of

the failures. Accordingly the nodes are judged as

healthy and sick nodes. For the sick nodes, CT is

referred to check for any allocations made. These

modules are then duplicated on some healthy node,

selected randomly. The algorithm for the same is

shown in the box.

Now if a failure is detected the system does

not fail completely as copies of the modules on

the failed node are still available on some other

nodes. The execution of the job still follows the

JPG with the penalty of increase in the turnaround

time. It is due to some nodes waiting for the pre-

communication cost between a module m h with

the previous modules m i as per the JPDG, B ihj

being the number of bytes that need to be ex-

changed between modules m i and m h and D kl is

the hamming distance between nodes P k and P l

involved in data exchange. w is the scaling factor

−

∑ 1

(

)

to scale the term

B D

x x

into time

ihj

ijk

hjl

unit.

The reliability offered by the cluster of the grid,

ClusRel jn , as per the allocation pattern suggested by

the chromosome can be written as shown in Box

1, where ModRel ik is the reliability offered by the

grid when module m i has been assigned on node

P k . Introduction of replicated modules increases

the reliability of the job execution. At any time,

the reliability offered to the job with replication,

ClusRelRep jn , can be written as

ClusRelRep = ClusRel

C * ClusRel

(v)

Box 1.

ClusRel = ModRel

(iii)

i=1

ClusRel =

i-1

i=1

∑

exp - (

µ λ

) E .x +( + )

µ ξ

w(B D

)x .x +

ij+ kn

ijkn

ijk

ihj.

kln

ijk

hjl

prkn

h=1

(iv)

Cloud, Grid and High Performance Computing: Emerging Applications

Search WWH ::

Custom Search

Home