Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Another technique for dealing with software hangs is called a watchdog timer . A hard-

wareclockkeepsincrementingacounter.Ifthecounterexceedsacertainvalue,ahardware

subsystemwilldetectthisandrebootthesystem.Softwarerunningonthesystemresetsthe

counter to zero after any successful operation. If the software hangs, the resets will stop

and soon the system will be rebooted. As long as the software keeps running, the counter

will be reset frequently enough to prevent a reboot.

Awatchdogtimerismostcommonlyusedwithoperatingsystemkernelsandembedded

systems.EnablingtheLinuxkernelwatchdogtimeronasystemwithappropriate hardware

can be used to reduce the need to physically visit a machine when the kernel hangs or to

avoid the need to purchase expensive remote power control systems.

Like crashes, hangs should be logged and analyzed. Frequent hangs are an indication of

hardwareissues,lockingproblems,andotherbugsthatshouldbefixedbeforetheybecome

big problems.

6.5.3 Query of Death

Sometimes a particular API call or query exercises an untested code path that causes a

crash, a long delay, or an infinite loop. We call such a query a query of death because it

kills the service.

When users discover a query of death for a popular web site, they let all of their friends

know. Soon much of the internet will also be trying it to see what a crashing web site looks

like. The better known your company is, the faster word will spread.

The best fix is to eliminate the bug that causes the problem. Unfortunately, it can take a

long time to fix the code and push a new release. A quick fix is needed in the meantime.

A widely used strategy is to have a banned query list that is easy to update and com-

municate toall the frontends. The frontends automatically reject anyquerythat isfoundon

the banned query list.

However,that solution still requires humanintervention. Amoreautomated mechanism

is required, especially when a query has a large fan-out. For example, suppose the query is

received and then sent to 1000 other servers, each one holding 1/1000th of the database. A

query of death would kill 1000 servers along with all the other queries that are in flight.

Dean and Barroso ( 2013 ) describe a preventive measure pioneered at Google called ca-

nary requests . In situations where one would normally send the same request to thou-

sandsofleafservers,systemsusingthisapproachsendthequerytooneortwoleafservers.

These are the canary requests. Queries are sent to the remaining servers only if replies to

the canary requests are received in a reasonable period of time. If the leaf servers crash or

hang while the canary requests are being processed, the system flags the request as poten-

Search WWH ::

Custom Search

Home