Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

being awakened in the middle of the night because a machine is down, we are alerted only

if the needle of a gauge gets near the danger zone.

6.3.2 Load Sharing versus Hot Spares

Inthe previous examples, the replicas are load sharing: all are active, are sharing the work-

load equally (approximately), and have equal amounts of spare capacity (approximately).

Another strategy is to have primary and secondary replicas. In this approach, the primary

replica receives the entire workload but the secondary replica is ready to take over at any

time. This is sometimes called the hot spare or “hot standby” strategy since the spare is

connected to the system, running (hot), and can be switched into operation instantly. It is

alsoknownasanactive-passiveormaster-slavepair.Oftentherearemultiplesecondaries.

Because there is only one master, these configurations are 1 + M configurations.

Sometimestheterm“active-active”or“master-master”pairwillbeusedtorefertotwo

replicas that are load sharing. “Active-active” is more commonly used with network links.

“Master-master” ismorecommonly usedinthedatabase worldandinsituations wherethe

two are tightly coupled.

6.4 Failure Domains

A failure domain is the bounded area beyond which failure has no impact. For example,

when a car fails on a highway, its failure does not make the entire highway unusable. The

impact of the failure is bounded to its failure domain.

The failure domain of a fuse in a home circuit breaker box is the room or two that is

coveredbythatcircuit.Ifapowerlineiscut,thefailuredomainaffectsanumberofhouses

or perhaps a city block. The failure domain of a power grid might be the town, region, or

county that it feeds (which is why some datacenters are located strategically so they have

access to two power grids).

A failure domain may be prescriptive—that is, a design goal or requirement. You might

plan that two groups of servers are each their own failure domain and then engineer the

system to meet that goal, assuring that the failure domains that they themselves rely on

are independent. Each group may be in different racks, different power circuits, and so on.

Whether theyshouldbeindifferent datacenters dependsonthescopeofthefailure domain

goal.

Alternatively, a failure domain may be descriptive. Often we find ourselves exploring

a system trying to determine, or reverse-engineer, what the resulting failure domain has

become. Due to a failed machine, a server may have been moved temporarily to a spare

machine in another rack. We can determine the new failure domain by exploring the im-

plications of this move.

Search WWH ::

Custom Search

Home