Service Configuration and Coordination - Real-Time Analytics

Database Reference

In-Depth Information

must also somehow track the correctness of the current configuration or the

validity of its coordination efforts.

Managing these requirements in a distributed environment is a notoriously

difficult-to-solve problem, often leading to incorrect server behavior.

Alternatively, if the problems are not addressed, they can lead to single

points of failure in the distributed system. For “offline” processing systems,

this is only a minor concern as the single point of failure can be

re-established manually. In real-time systems, this is more of a problem as

recovery introduces potentially unacceptable delays in processing or, more

commonly, missed processing entirely.

This leads directly to the motivation behind configuration and coordination

systems: providing a system-wide service that correctly and reliably

implements distributed configuration and coordination primitives.

These primitives, similar to the coordination primitives provided for

multithreaded development, are then used to implement distributed

versions of high-level algorithms.

Maintaining Distributed State

Writingconcurrentcodethatsharesstatewithinasingleapplicationishard.

Even with operating systems and development environments providing

support for concurrency primitives, the interaction between threads and

processes is still a common source of errors. The problems are further

compounded when the concurrency spans multiple machines. Now, in

addition to the usual problems of concurrency—deadlocks, race conditions,

and so on—there are a host of new problems to address.

Unreliable Network Connections

Even in the most well-controlled datacenter, networks are unreliable

relative to a single-machine. Latency can vary widely from moment to

moment, bandwidth can change over time, and connections can be lost.

In a wide area network, a “Backhoe Event” can sever connections between

previously unified networks. For concurrent applications, this last event

(which can happen within a single datacenter) is the worst problem.

In concurrent programming, the loss of connectivity between two groups

of systems is known as the “Split Brain Problem.” When this happens, a

Search WWH ::

Custom Search

Home