Information Technology Reference
In-Depth Information
Determining a failure domain is done within a particular scope or assumptions about
howlargeanoutagewearewillingtoconsider.Forexample,wemaykeepoff-sitebackups
1000 miles away, assuming that an outage that affects two buildings that far apart is an ac-
ceptable risk, or that a disaster that large would mean we'd have other problems to worry
about.
Unaligned Failure Domains Increase Outage Impact
Acompanywithmanylargedatacentersusedanarchitectureinwhichapowerbus
was shared by every group of six racks. A network subsystem provided network
connectivity for every eight racks. The network subsystem received power from
the first rack of each of its groups.
If a power bus needed to be turned off for maintenance, the outage this would
createwouldinvolvethesixracksdirectlyattachedtoitforpower,plusotherracks
would lose network connectivity if they were unlucky enough to be on a network
subsystem that got power from an affected rack. This extended the failure domain
to as many as 13 racks. Many users felt it was unfair that they were suffering even
though the repair didn't directly affect them.
There were additional unaligned failure domains related to cooling and which
machines were managed by which cluster manager. As a result, these misalign-
ments were not just an inconvenience to some but a major factor contributing to
system availability.
Eventually a new datacenter design was created that aligned all physical failure
domains to a common multiple. In some cases, this meant working with vendors
to create custom designs. Old datacenters were eventually retrofitted to the new
design at great expense.
We commonly hear of datacenters that have perfect alignment of power, networking,
and other factors but in which an unexpected misalignment results in a major outage. For
example, consider a datacenter with 10 domains, each independently powered, cooled, and
networked. Suppose the building has two connections to the outside world and the related
equipment is located based on where the connections come into the building. If cooling
fails in the two domains that include those connections, suddenly all 10 domains have no
connectivity to the outside world.
Search WWH ::




Custom Search