Information Technology Reference
In-Depth Information
Case Study: Repeating Risky Behavior to Reduce Risk
At one company everyone knew that the last time a database needed to be failed
over, it didn't go well. Therefore the team feared doing the procedure and avoided
it, thinking this was good risk management. Instead, it actually increased risk. If
the process was needed in an emergency, it was unlikely to work. Realizing this
fact, the manager made the team fail the service over every week.
The first few times were rough, but the entire team watched and commented
as the person performed the database failover. The procedure topic was updated
as comments were made. Pre-checks were documented. A team member pointed
out that before she did the failover, she verified that the disks had plenty of free
disk space because previously that had been a problem. The rest of the team didn't
know about this pre-check, but now it could be added to the procedure topic and
everyone would know to do it.
More importantly, this realization raised an important issue: why wasn't the
amount of free disk space always being monitored? What would happen if an
emergency failover was needed and disk space was too low? A side project was
spawned to monitor the system's available disk space. Many similar issues were
discovered and fixed.
Eventually the process got more reliable and soon confidence increased. The
team had one less source of stress.
15.2 Individual Training: Wheel of Misfortune
There are many different ways a system can break. People oncall should be able to handle
the most common ones, but they need confidence in their ability to do so. We build confid-
ence by providing training in many different ways: documentation, mentoring, shadowing
more experienced people, and practice.
Wheel of Misfortune is a game that operational teams play to prepare people for oncall.
Itisawaytoimproveanindividual'sknowledgeofhowtohandleoncalltasksandtoshare
bestpractices.Itcanalsobeusedtointroducenewprocedurestotheentireteam.Thisgame
enables team members to maintain skills, learn new skills as needed, and learn from each
other. The game is played as follows.
The entire team meets in a conference room. Each round involves one person volun-
teering to be the contestant and another volunteering to be the Master of Disaster (MoD).
The MoD explains an oncall situation and the contestant explains how they would work
the issue. The MoD acts as the system, responding to any actions taken by the contestant.
Search WWH ::




Custom Search