Database Reference
In-Depth Information
with what is known as a deadlock . Avoiding deadlocks is a real challenge in devel-
oping distributed programs, especially when the number of tasks is scaled up. To
build upon the example of tasks A and B , let us assume a larger set of tasks A , B , C ,
Z . In ensuring mutual exclusion, task A might wait on task B , if B is holding a lock
required by A . In return, task B might wait on task C , if C is holding a lock required
by B . The “wait on” sequence can carry on all the way up to task Z . Specifically,
task C might wait on task D , and task D might wait on task E , all the way until task
Y , which might also wait on task Z . Such a “wait on” chain is usually referred to as
transitive closure . When a transitive closure occurs, a circular wait is said to arise.
Circular waits lead normally to stark deadlocks that might bring the whole distrib-
uted programs/systems to grinding halts. Lastly, we note that the “wait on” relation
is at the heart of every mutual exclusion mechanism. In particular, no mutual exclu-
sion protocol can preclude it; no matter how clever it is [36]. In normal scenarios,
a task expects to “wait on” for a limited (reasonable) amount of time. But what if a
task that is holding a lock/token crashes? This suggests another major challenge that
distributed programs need to address, that is, fault tolerance .
1.6.5 F ault t oleranCe
A basic feature that distinguishes distributed systems such as the cloud fromunipro-
cessor systems is the concept of partial failures . Specifically, in distributed systems
if a node or component fails, the whole system can continue functioning. On the other
hand, if one component (e.g., the RAM) fails in a uniprocessor system, the whole
system will fail. A crucial objective in designing distributed systems/programs is to
construct them in a way that they can automatically tolerate partial failures without
seriously affecting performance. A key technique for masking faults in distributed
systems is to use hardware redundancy such as the RAID technology [56]. In most
cases, however, distributed programs cannot only depend on the underlying hard-
ware fault-tolerance techniques of distributed systems. Thus, they usually apply their
own fault-tolerance techniques. Among these techniques is software redundancy .
A common type of software redundancy is task redundancy (or resiliency or
replication ). Task replication is applied as a protection against task failures. In par-
ticular, tasks can be replicated as flat or hierarchical groups , exemplified in Figure
1.15. In flat groups (see Figure 1.15a), all tasks are identical in a sense that they
all carry the same work. Eventually, only the result of one task is considered and
the other results are discarded. Obviously, flat groups are symmetrical and preclude
single point of failures (SPOFs). Particularly, if one task crashes, the application will
stay in business, yet the group will become smaller until recovered. However, if for
some applications, a decision is to be made (e.g., acquiring a lock), a voting mecha-
nism might be required. As discussed earlier, voting mechanisms incur implementa-
tion complexity, communication delays, and performance overheads.
A hierarchical group (see Figure 1.15b) usually employs a coordinator task and
specifies the rest of the tasks as workers. In this model, when a user request is made,
it first gets forwarded to the coordinator who, in return, decides which worker is
best suited to fulfill it. Clearly, hierarchical groups reflect opposite properties as
compared with flat ones. In particular, the coordinator is an SPOF and a potential
Search WWH ::




Custom Search