Information Technology Reference
In-Depth Information
Another technique for dealing with software hangs is called a watchdog timer . A hard-
wareclockkeepsincrementingacounter.Ifthecounterexceedsacertainvalue,ahardware
subsystemwilldetectthisandrebootthesystem.Softwarerunningonthesystemresetsthe
counter to zero after any successful operation. If the software hangs, the resets will stop
and soon the system will be rebooted. As long as the software keeps running, the counter
will be reset frequently enough to prevent a reboot.
Awatchdogtimerismostcommonlyusedwithoperatingsystemkernelsandembedded
systems.EnablingtheLinuxkernelwatchdogtimeronasystemwithappropriate hardware
can be used to reduce the need to physically visit a machine when the kernel hangs or to
avoid the need to purchase expensive remote power control systems.
Like crashes, hangs should be logged and analyzed. Frequent hangs are an indication of
hardwareissues,lockingproblems,andotherbugsthatshouldbefixedbeforetheybecome
big problems.
6.5.3 Query of Death
Sometimes a particular API call or query exercises an untested code path that causes a
crash, a long delay, or an infinite loop. We call such a query a query of death because it
kills the service.
When users discover a query of death for a popular web site, they let all of their friends
know. Soon much of the internet will also be trying it to see what a crashing web site looks
like. The better known your company is, the faster word will spread.
The best fix is to eliminate the bug that causes the problem. Unfortunately, it can take a
long time to fix the code and push a new release. A quick fix is needed in the meantime.
A widely used strategy is to have a banned query list that is easy to update and com-
municate toall the frontends. The frontends automatically reject anyquerythat isfoundon
the banned query list.
However,that solution still requires humanintervention. Amoreautomated mechanism
is required, especially when a query has a large fan-out. For example, suppose the query is
received and then sent to 1000 other servers, each one holding 1/1000th of the database. A
query of death would kill 1000 servers along with all the other queries that are in flight.
Dean and Barroso ( 2013 ) describe a preventive measure pioneered at Google called ca-
nary requests . In situations where one would normally send the same request to thou-
sandsofleafservers,systemsusingthisapproachsendthequerytooneortwoleafservers.
These are the canary requests. Queries are sent to the remaining servers only if replies to
the canary requests are received in a reasonable period of time. If the leaf servers crash or
hang while the canary requests are being processed, the system flags the request as poten-
Search WWH ::




Custom Search