Automation - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Case Study: Automated Repair Life Cycle

Google uses the Ganeti open source virtual cluster management system to run

many large clusters of physical machines, which in turn provide virtual machines

to thousands of users. A physical machine rarely fails, but because of the sheer

number of machines, hardware failures became quite frequent. As a result, SAs

spentalotoftimedealingwithhardwareissues.Tombecameinvolvedinaproject

to automate the entire repair life cycle.

First, tools were developed to assist in common operations, all of which were

complex, error prone, and required a high level of expertise:

• Drain Tool: When monitoring detected signs of pending hardware problems

(such as correctable disk or RAM errors), all virtual machines would be mi-

grated to other physical machines.

• Recovery Tool: When a physical machine unexpectedly died, this tool made

several attempts to power it off and on. If these efforts failed to recover the

machine, the virtual machines would be restarted from their last snapshot on

another physical machine.

• Send to Repairs Tool: When a machine needed physical repairs, there was a

procedure for notifying the datacenter technicians about which machine had a

problem and what needed to be done. This tool gathered problem reports and

used the machine repair API to request the work. It included the serial number

of any failing disks, the memory slot of any failing RAM, and so on. In most

cases the repair technician was directed to the exact problem, reducing repair

time.

• Re-assimilate Tool: When a machine came back from repairs, it needed to be

evaluated, configured, and readded to the cluster.

Each of these tools was improved over time. Soon the tools did their tasks better

than people could, with more error checking than a person would be likely to do.

Oncall duties involved simply running combinations of these tools.

Now the entire system could be fully automated by combining these tools. A

system was built that tracked the state of a machine (alive, having problems, in

repairs, being re-assimilated). It used the APIs of the monitoring system and the

repair status console to create triggers that activated the right tool at the right time.

As a result the oncall responsibilities were reduced from multiple alerts each day

to one or two alerts each week.

Search WWH ::

Custom Search

Home