Disaster Preparedness - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

• Identifying the root cause is not intended to apportion blame, but rather to learn

how to improve the system and operational practices.

• Building reliable software on top of unreliable components means that resiliency

features are expected, not an undue burden, a “nice to have” feature, or an extra-

vagance.

• At cloud scale, complex failures are inevitable and unpredictable.

• In production, complex systems often interact in ways that aren't explicitly known

at first (timeouts, resource contention, handoffs).

Ideally we'd like perfect systems that have perfect uptime. Sadly, such systems don't exist

outside of sales presentations. Until such systems do exist, we'd rather have enough fail-

ures to ensure confidence in the precautionary measures we put in place. Failover mech-

anisms need to be exercised whether they are automatic or manual. If they are automatic,

the more time that passes without the mechanism being activated, the less confident we

can be that it will work properly. The system may have changed in ways that are unexpec-

tedlyincompatibleandbreakthefailovermechanism.Ifthefailovermechanismisamanu-

al procedure, we not only lose confidence in the procedure, but we also lose confidence in

the team's ability to do the procedure. In other words, the team gets out of practice or the

knowledge becomes concentrated among a certain few. Ideally, we want services that fail

often enough to maintain confidence in the failover procedure but not often enough to be

detrimental to the service itself. Therefore, if a component is too perfect, it is better to arti-

ficially cause a failure to reestablish confidence.

Search WWH ::

Custom Search

Home