Information Technology Reference
In-Depth Information
• Identifying the root cause is not intended to apportion blame, but rather to learn
how to improve the system and operational practices.
• Building reliable software on top of unreliable components means that resiliency
features are expected, not an undue burden, a “nice to have” feature, or an extra-
vagance.
• At cloud scale, complex failures are inevitable and unpredictable.
• In production, complex systems often interact in ways that aren't explicitly known
at first (timeouts, resource contention, handoffs).
Ideally we'd like perfect systems that have perfect uptime. Sadly, such systems don't exist
outside of sales presentations. Until such systems do exist, we'd rather have enough fail-
ures to ensure confidence in the precautionary measures we put in place. Failover mech-
anisms need to be exercised whether they are automatic or manual. If they are automatic,
the more time that passes without the mechanism being activated, the less confident we
can be that it will work properly. The system may have changed in ways that are unexpec-
tedlyincompatibleandbreakthefailovermechanism.Ifthefailovermechanismisamanu-
al procedure, we not only lose confidence in the procedure, but we also lose confidence in
the team's ability to do the procedure. In other words, the team gets out of practice or the
knowledge becomes concentrated among a certain few. Ideally, we want services that fail
often enough to maintain confidence in the failover procedure but not often enough to be
detrimental to the service itself. Therefore, if a component is too perfect, it is better to arti-
ficially cause a failure to reestablish confidence.
Search WWH ::




Custom Search