Information Technology Reference
In-Depth Information
Case Study: Automated Repair Life Cycle
Google uses the Ganeti open source virtual cluster management system to run
many large clusters of physical machines, which in turn provide virtual machines
to thousands of users. A physical machine rarely fails, but because of the sheer
number of machines, hardware failures became quite frequent. As a result, SAs
spentalotoftimedealingwithhardwareissues.Tombecameinvolvedinaproject
to automate the entire repair life cycle.
First, tools were developed to assist in common operations, all of which were
complex, error prone, and required a high level of expertise:
Drain Tool: When monitoring detected signs of pending hardware problems
(such as correctable disk or RAM errors), all virtual machines would be mi-
grated to other physical machines.
Recovery Tool: When a physical machine unexpectedly died, this tool made
several attempts to power it off and on. If these efforts failed to recover the
machine, the virtual machines would be restarted from their last snapshot on
another physical machine.
Send to Repairs Tool: When a machine needed physical repairs, there was a
procedure for notifying the datacenter technicians about which machine had a
problem and what needed to be done. This tool gathered problem reports and
used the machine repair API to request the work. It included the serial number
of any failing disks, the memory slot of any failing RAM, and so on. In most
cases the repair technician was directed to the exact problem, reducing repair
time.
Re-assimilate Tool: When a machine came back from repairs, it needed to be
evaluated, configured, and readded to the cluster.
Each of these tools was improved over time. Soon the tools did their tasks better
than people could, with more error checking than a person would be likely to do.
Oncall duties involved simply running combinations of these tools.
Now the entire system could be fully automated by combining these tools. A
system was built that tracked the state of a machine (alive, having problems, in
repairs, being re-assimilated). It used the APIs of the monitoring system and the
repair status console to create triggers that activated the right tool at the right time.
As a result the oncall responsibilities were reduced from multiple alerts each day
to one or two alerts each week.
Search WWH ::




Custom Search