Information Technology Reference
In-Depth Information
18.1.3 Planned Growth
Thesecondstepisestimatingadditionalgrowthduetomarketingandbusinessevents,such
as new product launches or new features. For example, the marketing department may be
planning a major campaign in May that it predicts will increase the customer base by 20
to 25 percent. Or perhaps a new product is scheduled to launch in August that relies on
three existing services and is expected to increase the load on each of those by 10 percent
at launch, increasing to 30 percent by the end of the year. Use the data from any changes
detected in the first step to validate the assumptions about expected growth.
18.1.4 Headroom
Headroom is the amount of excess capacity that is considered routine. Any service will
have usage spikes oredge conditions that require extended resource usage occasionally.To
prevent these edge conditions from triggering outages, spare resources must be routinely
available. How much headroom is needed for any given service is a business decision.
Since excess capacity is largely unused capacity, by its very nature it represents potentially
wasted investment. Thus a financially responsible company wants to balance the potential
for service interruption with the desire to conserve financial resources.
Your monitoring data should be picking up these resource spikes and providing hard
statisticaldataonwhen,where,andhowoftentheyoccur.Dataonoutagesandpostmortem
reports are also key in determining reasonable headroom.
Another component in determining how much headroom is needed is the amount of
time it takes to have additional resources deployed into production from the moment that
someone realizes that additional resources are required. If it takes three months to make
new resources available, then you need to have more headroom available than if it takes
two weeks or one month. At a minimum, you need sufficient headroom to allow for the
expected growth during that time period.
18.1.5 Resiliency
Reliable services also need additional capacity to meet their SLAs. The additional capacity
allowsforsomecomponentstofail,withouttheendusersexperiencinganoutageorservice
degradation. As discussed in Chapter 6 , the additional capacity needs to be in a different
failure domain; otherwise, a single outage could take down both the primary machines and
the spare capacity that should be available to take over the load.
Failure domains also should be considered at a large scale, typically at the datacenter
level. For example, facility-wide maintenance work on the power systems requires the en-
tire building to be shut down. If an entire datacenter is offline, the service must be able to
smoothly run from the other datacenters with no capacity problems. Spreading the service
Search WWH ::




Custom Search