Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

The primary strategy for dealing with this problem in user-facing services is graceful

degradation. This topic was covered in Section 2.1.10 .

Dynamic Resource Allocation

Another strategy is to add capacity dynamically. With this approach, a system would

detect that a service is becoming overloaded and allocate an unused machine from a pool

of idle machines that are running but otherwise unconfigured. An automated system would

configure the machine and use it to add capacity to the overloaded service, thereby resolv-

ing the issue.

It can be costly to have idle capacity but this cost can be mitigated by using a shared

pool . That is, one pool of idle machines serves a group of services. The first service to be-

come overloaded allocates themachines. Ifthepoolislargeenough,morethanoneservice

canbecome overloaded atthesame time. There shouldalsobeamechanism forservices to

give back machines when the need disappears.

Additionalcapacitycanbefoundatotherserviceprovidersaswell.Apubliccloudcom-

putingprovidercanbeusedasthesharedpool.Usuallyyouwillnothavetopayforunused

capacity.

Shared resource pools are not just appropriate for machines, but may also be used for

storage and other resources.

Load Shedding

Another strategy is load shedding . With this strategy the service turns away some users so

that other users can have a good experience.

Tomake ananalogy,anoverloaded phonesystem doesn'tsuddenlydisconnect all exist-

ing calls. Instead, it responds to any new attempts to make a call with a “fast busy” tone so

that the person will try to make the call later. An overloaded web site should likewise give

some users an immediate response, such as a simple “come back later” web page, rather

than requiring them to time out after minutes of waiting.

A variation of load shedding is stopping certain tasks that can be put off until later. For

example, low-priority database updates could be queued up for processing later; a social

network that stores reputation points for users might store the fact that points have been

awardedratherthanprocessingthem;nightlybulkfiletransfersmightbedelayedifthenet-

work is overloaded.

Thatsaid,tasksthatcanbeputoffforacoupleofhoursmightcauseproblemsiftheyare

put off forever. There is, after all, a reason they exist. For any activity that is delayed due

to load shedding, there must be a plan on how such a delay is handled. Establish a service

level agreement (SLA) to determine how long something can be delayed and to identify a

Search WWH ::

Custom Search

Home