It is somewhat reasonable to consider recovering from a deadlock in the case of a process dying
unexpectedly. In other deadlock situations, where threads are waiting for each other, you really
shouldn't be looking at recovery techniques. You should be looking at your coding techniques.
System V-shared semaphores do make provision for recovery, and they may prove to be the
solution to your problem. They provide room for a system-maintained "undo" structure, which
will be invoked should the owner process die, and they can be reset by any process with
permission. They are expensive to use, though, and add complexity to your code.
Both Win32 and UI robust mutexes have built-in "death detection" also, so that your program can
find out that the mutex it was waiting for was held by a newly dead thread.
Still, just having to undo structures that can reset mutexes does not solve the real problem. The
data protected may be inconsistent, and this is what you have to deal with. It is possible to build
arbitrarily complex undo structures for your code, but it is a significant task that should not be
Database systems do this routinely via two-phase commit strategies, as they have severe
restrictions on crash recovery. Essentially, what they do is (1) build a time-stamped structure
containing what the database will look like at the completion of the change; (2) save that structure
to disk and begin the change; (3) complete the change; (4) update the time stamp on the database;
and (5) delete the structure. A crash at any point in this sequence of events can be recovered from
Java does not have anything similar to these recoverable mutexes, nor does it need them. Java
programs are either single process programs (in which case a deadlock is a programming bug) or
they use RMI or some other kind of remote method invocation (in which case the RMI package is
responsible for dealing with dead processes).
Be very, very careful when dealing with this problem!
The Lost Wakeup
If you simply neglect to hold the lock while testing or changing the value of the condition, your
program will be subject to the fearsome lost wakeup problem. This condition occurs when one of
your threads misses a wakeup signal because it had not yet gone to sleep. Of course, if you're not
protecting your shared data correctly, your program will be subject to numerous other bugs, so this
is nothing special. In Java it is not possible to suffer the lost wakeup problem just using
notify()/wait() directly because you must hold the lock before you can call notify().
However, you can create constructs in Java that will have this problem. The Mutex and
ConditionVar classes that we just built are subject to lost wakeup.
In Code Example 7-10 (slightly modified from our StopQueue example), it is possible for the
stopper (which has failed to use the lock) to decide that it's time to stop and broadcast right at the
instant between when the consumer checks the condition and when it goes to sleep. This code will
The probability that the stopper would get to run at exactly the right (er, wrong) instant is very
small. (In 1000 test runs of this code it did not occur once.) If we insert a slight delay in the
consumer between the test and the call to condBroadcast(), we can get it to happen. (In the
example code on the Web, the program LostWakeup allows you to vary the sleep time (delay)
to see how often it occurs on your machine.)
Example 7-10 The Lost Wakeup Problem
Search WWH :