Information Technology Reference
In-Depth Information
Preparing for Outages
Although it's more of a disaster planning and recovery concern, preparing for outages is
relevant and of utmost importance in the context of cost and capacity optimization. Public
cloud outages can cause huge losses in revenue for a company. They can even alienate an
organization's main customers. Companies should generally follow three key principles:
prepare for failure, design for failure, and have an alternative.
Engineers should have a good understanding of the weak points of the system. They
should also be prepared for real disaster, and one way to do that is to carry out service
outage drills. Cloud environments are composed of multiple machines, and machines fail.
The system must be designed to handle failure, and although it may be expensive, it might
pay off in the end.
Furthermore, a few of the mission-critical services can be deployed and served out
of the alternative data centers if necessary. This helps in minimizing the risk of “total
blackout” during an outage of a particular public cloud data center housing an orga-
nization's applications. One option could be to deploy applications in multiple regions
and in isolated locations within those regions, such as by using Amazon's Regions
and Availability Zones. The idea is to spread the chances of failure over a number
of locations, which automatically reduces the probability of total blackout. This is
because failures are usually isolated to specific locations and do not spread outward
to other regions. See the following location for information on Amazon's Regions and
Availability Zones:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-
availability-zones.html
Fine-Tuning Auto-Scaling Rules
Applications that are able to automatically scale the number of server instances offer flexibility
and great opportunity for optimization. For example, you could have an auto-scaling rule that
spawns a new instance once CPU utilization reaches 80 percent on all current instances and
another that spawns once average CPU utilization reaches 50 percent.
However, how do businesses know whether 80 percent and 50 percent are the right per-
centages? There are two methods to determine the right percentage. The first is based on the
trial-and-error method. An organization stress tests its application and determines loads under
which the response time of an application starts lagging behind the usual response time or
causes a noticeable delay. The second approach is to calculate the maximum number of tasks,
users, or processes an application can simultaneously handle and convert that to a percentage
in terms of compute capacity. Factors other than compute capacity can also be included, such
as memory footprint, network utilization, and disk utilization.
Nevertheless, you may still need to experiment with different combinations to get it
absolutely spot-on and then be able to perform considerable optimization.
Search WWH ::




Custom Search