Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

6.3 Resiliency through Spare Capacity

The general strategy used to gain resiliency is to have redundant units of capacity that can

fail independently of each other. Failures are detected and those units are removed from

service. The total capacity of the system is reduced but the system is still able to run. This

means that systems must be built with spare capacity to begin with.

Let's use the example of a web server that serves the static images displayed on a web

site. Such a server is easy to replicate because the content does not change frequently. We

can, for example, build multiple such servers and load balance between them. (How load

balancers work was discussed in Section 4.2.1 .) We call these servers replicas because the

sameservice isreplicated byeachserver.Theyareduplicates inthattheyallrespondtothe

same queries and give equivalent results. In this case the same images are accessed at the

same URLs.

Suppose each replica can handle 100 QPS and the service receives 300 QPS at peak

times.Threeserverswouldberequiredtoprovidethe300QPScapacity.Anadditionalrep-

licaisneededtoprovidethesparecapacityrequiredtosurviveonefailedreplica.Failureof

anyonereplicaisdetectedandthatreplicaistakenoutofserviceautomatically.Theloadis

now balanced over the surviving three replicas. The total capacity of the system is reduced

to 300 QPS, which is sufficient.

We call this N + M redundancy . Such systems require N units to provide capacity and

have M units of extra capacity. Units are the smallest discrete system that provides the ser-

vice. The term N + 1 redundancy is used when we wish to indicate that there is enough

spare capacity for one failure, such as in our example. If we added a fifth server, the sys-

tem would be able to survive two simultaneous failures and would be described as N + 2

redundancy .

What ifwehad3+1redundancy andaseries offailures? After the first failure, the sys-

tem is described as 3 + 0. It is still running but there is no redundancy. The second failure

(a double failure) would result in the system being oversubscribed . That is, there is less

capacity available than needed.

Continuing our previous example, when there are two failed replicas, there is 200 QPS

of capacity. The system is now 3:2 oversubscribed: two replicas exist where three are

needed. If we are lucky, this has happened at a time of day that does not draw many users

and 200 QPS is sufficient. However, if we are unlucky, this has happened at peak usage

time and our two remaining servers are faced with 300 QPS, more than they are designed

to handle. Dealing with such an overload is covered later in Section 6.7.1 .

Search WWH ::

Custom Search

Home