Designing for Operations - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Easy Configuration Does Not Require a GUI

A product manager from IBM once told Tom that the company had spent a lot

of money adding a graphical user interface (GUI) to a system administration tool.

Thiswasdonetomakeiteasier toconfigure.Totheteam'sdismay,themajority of

their customers did not use the GUI because they had written Perl scripts to gener-

ate the configuration files.

2.1.2 Startup and Shutdown

The service should restart automatically when a machine boots up. If the machine is shut

down properly, the system should include the proper operating system (OS) hooks to shut

the service down properly. If the machine crashes suddenly, the next restart of the system

should automatically perform data validations or repairs before providing service.

Ensuring that a service restarts after a reboot can be as simple as installing a boot-time

script, or using a system that monitors processes and restarts them (such as Ubuntu Up-

start). Alternatively, it can be an entire process management system like Apache Mesos

( Metz 2013 ) orGoogleOmega ( Schwarzkopf, Konwinski, Abd-El-Malek &Wilkes 2013 ) ,

which not only restarts a process when a machine reboots, but also is able to restart the

process on an entirely different machine in the event of machine death.

The amount of time required to start up or shut down a system should be documented.

This is needed for preparing for disaster recovery situations. One needs to know how

quickly a system can be safely shut down to plan the battery capacity of uninterruptible

power supply (UPS) systems. Most UPS batteries can sustain a system for about five

minutes. After a power outage, starting up thousands of servers can be very complex.

Knowing expected startup times and procedures can dramatically reduce recovery time.

Testing for how a system behaves when all systems lose power concurrently is import-

ant. It's a common datacenter stressor. Thousands of hard disk motors spinning up at the

same time create a huge power draw that can overload power systems. In general, one can

expect 1 to 5 percent of machines to not boot on the first try. In a system with 1000 ma-

chines, a large team of people might be required to resuscitate them all.

Related to this is the concept of “crash-only” software. Candea & Fox ( 2003 ) observe

that the post-crash recovery procedure in most systems is critical to system reliability, yet

receives a disproportionately small amount of quality assurance (QA) testing. A service

that is expected to have high availability should rarely use the orderly shutdown process.

To align the importance of the recovery procedure with the amount of testing it should re-

ceive, these authors propose not implementing the orderly shutdown procedure or the or-

derlystartupprocedures.Thus,theonlywaytostopthesoftwareistocrashit,andtheonly

Search WWH ::

Custom Search

Home