Information Technology Reference
In-Depth Information
Amazon EC2 Outage
On June 29, 2012, users of the Internet had firsthand experience with downtime brought
about by infrastructure vulnerability. By 11:21 p.m. EST, a large portion of the Internet was
offline because of massive power outages brought on by a severe thunderstorm in North
Virginia, where one of Amazon's biggest EC2 data centers is located. This affected only a
single availability zone, but that zone contained widely popular media and social media
sites, including Netflix, Instagram, and Pinterest. There were other sites affected as well,
but the outage was immediately sensationalized because of the popularity of the afore-
mentioned sites. Amazon reported to have resolved majority of the issues at 1:15 a.m., and
all affected sites were brought back up a few hours later. Overall, the outage lasted less
than 12 hours, but the amount of attention it garnered was staggering.
Since Amazon cloud services reach every corner of the Internet, when they fail, everyone
feels it. But what about the promise of no downtime with cloud computing? Well, that
promise is only a guarantee and needs to be explicitly applied to the uploaded application
or website by the application or website provider. Amazon makes it easy for users to run
their AWS workloads across availability zones and provides various redundancy measures.
This simply means that the affected websites did not apply an important feature of cloud
computing. And it shows that users are still not utilizing the cloud to its full potential.
Outages Caused Due to Environmental Reasons
Environmental issues are the leading causes of IT equipment failures. In fact, among the root
causes of downtime found by the Emerson study, 15 percent of all the root causes can be
attributed to an environmental variable like thermal issues and water incursion. Detection
and recovery from these failures also incurred significant costs, at an average of more than
$489,000 per incident. And when these environmental issues cause real equipment failure,
it resulted in the highest overall cost, at more than $750,000, because expensive equipment
has to be replaced in addition to the cost of man power and further downtime associated
with the procedure.
The problem with environmental issues is that they can cause a chain reaction of IT
equipment failures, which would require extensive efforts for detection and recovery of
the issue that caused the outage, not to mention replacement of the equipment. The fact
that cooling equipment does not even need to fail to cause an IT equipment failure is wor-
risome. It shows a deeper problem within the cooling infrastructure itself. These isolated
failures, which are typically caused by hotspots within the racks themselves, are often the
result of inadequacies of the cooling infrastructure rather than a cooling equipment failure.
Search WWH ::




Custom Search