Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

6.3.1 How Much Spare Capacity

Spare capacity is like an insurance policy: it is an expense you pay now to prepare for

future trouble that you hope does not happen. It is better to have insurance and not need it

thantoneedinsuranceandnothaveit.Thatsaid,payingfortoomuchinsuranceiswasteful

and not good business. Selecting the granularity of our unit of capacity enables us to man-

age the efficiency. For example, in a 1 + 1 redundant system, 50 percent of the capacity is

spare. In a 20 + 1 redundant system, less than 5 percent of the capacity is spare. The latter

is more cost-efficient.

The other factors in selecting the amount of redundancy are how quickly we can bring

up additional capacity and how likely it is that a second failure will happen during that

time. The time it takes to repair or replace the down capacity is called the mean time to

repair (MTTR). The probability an outage will happen during that time is the reciprocal of

the mean time between failures. The percent probability that a second failure will happen

during the repair window is MTTR/MTBF × 100.

If a second failure means data loss, the probability of a second failure becomes an im-

portant factor in how many spares you should have.

Suppose it takes a week (168 hours) to repair the capacity and the MTBF is 100,000

hours.Thereisa168/1,000,000 × 100=1.7percent,or1in60,chanceofasecondfailure.

Now suppose the MTBF is two weeks (336 hours). In this case, there is a 168/336 ×

100 = 50 percent, or 1 in 2, chance of a second failure—the same as a coin flip. Adding an

additional replica becomes prudent.

MTTRisafunctionofanumberoffactors.Aprocessthatdiesandneedstoberestarted

has a very fast MTTR. A broken hardware component may take only a few minutes to re-

place, but if that server is in a datacenter 9000 miles away, it may take a month before

someone is able to reach it. Spare parts need to be ordered, shipped, and delivered. Even if

a disk can be replaced within minutes of failure, if it is in a RAID configuration there may

be a long, slow rebuild time where the system is still N + 0 until the rebuild is complete.

Ifallthismathmakesyourheadspin,hereisasimpleruleofthumb: N +1isaminimum

for a service; N + 2 is needed if a second outage is likely while you are fixing the first one.

Digital computers are either on or off, and we think in terms of a service as either run-

ning or not: it is either up or down. When we use resiliency through replication, the ser-

vice is more like an analog device: it can be on, off, or anywhere in between. We are no

longer monitoring the service to determine if it is up or down. Instead, we are monitoring

the amount of capacity in the system and determining whether we should be adding more.

This changes the way we think about our systems and how we do operations. Rather than

Search WWH ::

Custom Search

Home