Information Technology Reference
In-Depth Information
6.3.1 How Much Spare Capacity
Spare capacity is like an insurance policy: it is an expense you pay now to prepare for
future trouble that you hope does not happen. It is better to have insurance and not need it
thantoneedinsuranceandnothaveit.Thatsaid,payingfortoomuchinsuranceiswasteful
and not good business. Selecting the granularity of our unit of capacity enables us to man-
age the efficiency. For example, in a 1 + 1 redundant system, 50 percent of the capacity is
spare. In a 20 + 1 redundant system, less than 5 percent of the capacity is spare. The latter
is more cost-efficient.
The other factors in selecting the amount of redundancy are how quickly we can bring
up additional capacity and how likely it is that a second failure will happen during that
time. The time it takes to repair or replace the down capacity is called the mean time to
repair (MTTR). The probability an outage will happen during that time is the reciprocal of
the mean time between failures. The percent probability that a second failure will happen
during the repair window is MTTR/MTBF × 100.
If a second failure means data loss, the probability of a second failure becomes an im-
portant factor in how many spares you should have.
Suppose it takes a week (168 hours) to repair the capacity and the MTBF is 100,000
hours.Thereisa168/1,000,000 × 100=1.7percent,or1in60,chanceofasecondfailure.
Now suppose the MTBF is two weeks (336 hours). In this case, there is a 168/336 ×
100 = 50 percent, or 1 in 2, chance of a second failure—the same as a coin flip. Adding an
additional replica becomes prudent.
MTTRisafunctionofanumberoffactors.Aprocessthatdiesandneedstoberestarted
has a very fast MTTR. A broken hardware component may take only a few minutes to re-
place, but if that server is in a datacenter 9000 miles away, it may take a month before
someone is able to reach it. Spare parts need to be ordered, shipped, and delivered. Even if
a disk can be replaced within minutes of failure, if it is in a RAID configuration there may
be a long, slow rebuild time where the system is still N + 0 until the rebuild is complete.
Ifallthismathmakesyourheadspin,hereisasimpleruleofthumb: N +1isaminimum
for a service; N + 2 is needed if a second outage is likely while you are fixing the first one.
Digital computers are either on or off, and we think in terms of a service as either run-
ning or not: it is either up or down. When we use resiliency through replication, the ser-
vice is more like an analog device: it can be on, off, or anywhere in between. We are no
longer monitoring the service to determine if it is up or down. Instead, we are monitoring
the amount of capacity in the system and determining whether we should be adding more.
This changes the way we think about our systems and how we do operations. Rather than
Search WWH ::




Custom Search