Information Technology Reference
In-Depth Information
Easy Configuration Does Not Require a GUI
A product manager from IBM once told Tom that the company had spent a lot
of money adding a graphical user interface (GUI) to a system administration tool.
Thiswasdonetomakeiteasier toconfigure.Totheteam'sdismay,themajority of
their customers did not use the GUI because they had written Perl scripts to gener-
ate the configuration files.
2.1.2 Startup and Shutdown
The service should restart automatically when a machine boots up. If the machine is shut
down properly, the system should include the proper operating system (OS) hooks to shut
the service down properly. If the machine crashes suddenly, the next restart of the system
should automatically perform data validations or repairs before providing service.
Ensuring that a service restarts after a reboot can be as simple as installing a boot-time
script, or using a system that monitors processes and restarts them (such as Ubuntu Up-
start). Alternatively, it can be an entire process management system like Apache Mesos
( Metz 2013 ) orGoogleOmega ( Schwarzkopf, Konwinski, Abd-El-Malek &Wilkes 2013 ) ,
which not only restarts a process when a machine reboots, but also is able to restart the
process on an entirely different machine in the event of machine death.
The amount of time required to start up or shut down a system should be documented.
This is needed for preparing for disaster recovery situations. One needs to know how
quickly a system can be safely shut down to plan the battery capacity of uninterruptible
power supply (UPS) systems. Most UPS batteries can sustain a system for about five
minutes. After a power outage, starting up thousands of servers can be very complex.
Knowing expected startup times and procedures can dramatically reduce recovery time.
Testing for how a system behaves when all systems lose power concurrently is import-
ant. It's a common datacenter stressor. Thousands of hard disk motors spinning up at the
same time create a huge power draw that can overload power systems. In general, one can
expect 1 to 5 percent of machines to not boot on the first try. In a system with 1000 ma-
chines, a large team of people might be required to resuscitate them all.
Related to this is the concept of “crash-only” software. Candea & Fox ( 2003 ) observe
that the post-crash recovery procedure in most systems is critical to system reliability, yet
receives a disproportionately small amount of quality assurance (QA) testing. A service
that is expected to have high availability should rarely use the orderly shutdown process.
To align the importance of the recovery procedure with the amount of testing it should re-
ceive, these authors propose not implementing the orderly shutdown procedure or the or-
derlystartupprocedures.Thus,theonlywaytostopthesoftwareistocrashit,andtheonly
Search WWH ::




Custom Search