Capacity Planning - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

18.1.3 Planned Growth

Thesecondstepisestimatingadditionalgrowthduetomarketingandbusinessevents,such

as new product launches or new features. For example, the marketing department may be

planning a major campaign in May that it predicts will increase the customer base by 20

to 25 percent. Or perhaps a new product is scheduled to launch in August that relies on

three existing services and is expected to increase the load on each of those by 10 percent

at launch, increasing to 30 percent by the end of the year. Use the data from any changes

detected in the first step to validate the assumptions about expected growth.

18.1.4 Headroom

Headroom is the amount of excess capacity that is considered routine. Any service will

have usage spikes oredge conditions that require extended resource usage occasionally.To

prevent these edge conditions from triggering outages, spare resources must be routinely

available. How much headroom is needed for any given service is a business decision.

Since excess capacity is largely unused capacity, by its very nature it represents potentially

wasted investment. Thus a financially responsible company wants to balance the potential

for service interruption with the desire to conserve financial resources.

Your monitoring data should be picking up these resource spikes and providing hard

statisticaldataonwhen,where,andhowoftentheyoccur.Dataonoutagesandpostmortem

reports are also key in determining reasonable headroom.

Another component in determining how much headroom is needed is the amount of

time it takes to have additional resources deployed into production from the moment that

someone realizes that additional resources are required. If it takes three months to make

new resources available, then you need to have more headroom available than if it takes

two weeks or one month. At a minimum, you need sufficient headroom to allow for the

expected growth during that time period.

18.1.5 Resiliency

Reliable services also need additional capacity to meet their SLAs. The additional capacity

allowsforsomecomponentstofail,withouttheendusersexperiencinganoutageorservice

degradation. As discussed in Chapter 6 , the additional capacity needs to be in a different

failure domain; otherwise, a single outage could take down both the primary machines and

the spare capacity that should be available to take over the load.

Failure domains also should be considered at a large scale, typically at the datacenter

level. For example, facility-wide maintenance work on the power systems requires the en-

tire building to be shut down. If an entire datacenter is offline, the service must be able to

smoothly run from the other datacenters with no capacity problems. Spreading the service

Search WWH ::

Custom Search

Home