Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

with what is known as a deadlock . Avoiding deadlocks is a real challenge in devel-

oping distributed programs, especially when the number of tasks is scaled up. To

build upon the example of tasks A and B , let us assume a larger set of tasks A , B , C ,

Z . In ensuring mutual exclusion, task A might wait on task B , if B is holding a lock

required by A . In return, task B might wait on task C , if C is holding a lock required

by B . The “wait on” sequence can carry on all the way up to task Z . Specifically,

task C might wait on task D , and task D might wait on task E , all the way until task

Y , which might also wait on task Z . Such a “wait on” chain is usually referred to as

transitive closure . When a transitive closure occurs, a circular wait is said to arise.

Circular waits lead normally to stark deadlocks that might bring the whole distrib-

uted programs/systems to grinding halts. Lastly, we note that the “wait on” relation

is at the heart of every mutual exclusion mechanism. In particular, no mutual exclu-

sion protocol can preclude it; no matter how clever it is [36]. In normal scenarios,

a task expects to “wait on” for a limited (reasonable) amount of time. But what if a

task that is holding a lock/token crashes? This suggests another major challenge that

distributed programs need to address, that is, fault tolerance .

1.6.5 F ault t oleranCe

A basic feature that distinguishes distributed systems such as the cloud fromunipro-

cessor systems is the concept of partial failures . Specifically, in distributed systems

if a node or component fails, the whole system can continue functioning. On the other

hand, if one component (e.g., the RAM) fails in a uniprocessor system, the whole

system will fail. A crucial objective in designing distributed systems/programs is to

construct them in a way that they can automatically tolerate partial failures without

seriously affecting performance. A key technique for masking faults in distributed

systems is to use hardware redundancy such as the RAID technology [56]. In most

cases, however, distributed programs cannot only depend on the underlying hard-

ware fault-tolerance techniques of distributed systems. Thus, they usually apply their

own fault-tolerance techniques. Among these techniques is software redundancy .

A common type of software redundancy is task redundancy (or resiliency or

replication ). Task replication is applied as a protection against task failures. In par-

ticular, tasks can be replicated as flat or hierarchical groups , exemplified in Figure

1.15. In flat groups (see Figure 1.15a), all tasks are identical in a sense that they

all carry the same work. Eventually, only the result of one task is considered and

the other results are discarded. Obviously, flat groups are symmetrical and preclude

single point of failures (SPOFs). Particularly, if one task crashes, the application will

stay in business, yet the group will become smaller until recovered. However, if for

some applications, a decision is to be made (e.g., acquiring a lock), a voting mecha-

nism might be required. As discussed earlier, voting mechanisms incur implementa-

tion complexity, communication delays, and performance overheads.

A hierarchical group (see Figure 1.15b) usually employs a coordinator task and

specifies the rest of the tasks as workers. In this model, when a user request is made,

it first gets forwarded to the coordinator who, in return, decides which worker is

best suited to fulfill it. Clearly, hierarchical groups reflect opposite properties as

compared with flat ones. In particular, the coordinator is an SPOF and a potential

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home