Information Technology Reference
In-Depth Information
6.3 Resiliency through Spare Capacity
The general strategy used to gain resiliency is to have redundant units of capacity that can
fail independently of each other. Failures are detected and those units are removed from
service. The total capacity of the system is reduced but the system is still able to run. This
means that systems must be built with spare capacity to begin with.
Let's use the example of a web server that serves the static images displayed on a web
site. Such a server is easy to replicate because the content does not change frequently. We
can, for example, build multiple such servers and load balance between them. (How load
balancers work was discussed in Section 4.2.1 .) We call these servers replicas because the
sameservice isreplicated byeachserver.Theyareduplicates inthattheyallrespondtothe
same queries and give equivalent results. In this case the same images are accessed at the
same URLs.
Suppose each replica can handle 100 QPS and the service receives 300 QPS at peak
times.Threeserverswouldberequiredtoprovidethe300QPScapacity.Anadditionalrep-
licaisneededtoprovidethesparecapacityrequiredtosurviveonefailedreplica.Failureof
anyonereplicaisdetectedandthatreplicaistakenoutofserviceautomatically.Theloadis
now balanced over the surviving three replicas. The total capacity of the system is reduced
to 300 QPS, which is sufficient.
We call this N + M redundancy . Such systems require N units to provide capacity and
have M units of extra capacity. Units are the smallest discrete system that provides the ser-
vice. The term N + 1 redundancy is used when we wish to indicate that there is enough
spare capacity for one failure, such as in our example. If we added a fifth server, the sys-
tem would be able to survive two simultaneous failures and would be described as N + 2
redundancy .
What ifwehad3+1redundancy andaseries offailures? After the first failure, the sys-
tem is described as 3 + 0. It is still running but there is no redundancy. The second failure
(a double failure) would result in the system being oversubscribed . That is, there is less
capacity available than needed.
Continuing our previous example, when there are two failed replicas, there is 200 QPS
of capacity. The system is now 3:2 oversubscribed: two replicas exist where three are
needed. If we are lucky, this has happened at a time of day that does not draw many users
and 200 QPS is sufficient. However, if we are unlucky, this has happened at peak usage
time and our two remaining servers are faced with 300 QPS, more than they are designed
to handle. Dealing with such an overload is covered later in Section 6.7.1 .
Search WWH ::




Custom Search