Information Technology Reference
In-Depth Information
Such interactions are generally less frequently exercised, but when they are required it is
due to larger disasters. These exercises are more likely to lead to customer-visible out-
ages and loss of revenue than team-based service testing or random testing. For this reas-
on teams should be doing well at their own particular fire drills before getting involved in
DiRT.
15.4.1 Getting Started
While DiRT exercises can be very large, it is best to introduce a company to the concept
by starting small. A small project is easier to justify to management. A working small ex-
ample is easier to approve than a hypothetical big project, which, in all honesty, sounds
pretty terrifying to someone who is unfamiliar with the concept. An executive should be
very concerned if the people who are supposed to keep their systems up start telling them
about how much they want to take those systems down.
Starting small also means simpler tests. Google's DiRT started with only a few teams.
Tests were safe and were engineered so they could not create any user-visible disruptions.
This was done even if it meant tests weren't very useful. It got teams used to the concept,
reassured them that lessons learned would be used for constructive improvements to the
system, and let them know that failures would not become blame-fests or hunts for the
guilty. It also permitted the DiRT test coordinators to keep the processes simple and try out
their methodology and tracking systems.
Google's original Site Reliability Engineering (SRE) teams were located in the com-
pany's headquarters in Mountain View, California. Eventually Google added SRE teams
in Dublin, Ireland, to handle oncall at night-time. The first DiRT exercise simply tested
whether the Dublin SREs could run the service for an entire week without help. Mountain
View pretended to be unavailable. This exposed instances where Dublin SREs were de-
pending on the fact that if they had questions or issues, eventually a Mountain View SRE
would wake up and be able to assist.
Another of Google's early, simple tests was to try to work for one day without access
to the source code repository. This exercise found areas where production processes were
dependent on a system not intended to be always available. In fact, many tests involved
verifying that production systems did not rely on systems that were not built to be in the
critical path of production systems. The larger an organization grows, the more likely that
these dependencies will be found only through active testing.
15.4.2 Increasing Scope
Over time the tests can grow to include more teams. One can raise the bar for testing ob-
jectives, including riskier tests, live tests, and the removal of low-value tests.
Search WWH ::




Custom Search