Searching - Data Structures and Algorithm Analysis

Java Reference

In-Depth Information

insertions and deletions. After the table is loaded with the initial collection of

records, the first few deletions will lengthen the average probe sequence distance

for records (it will add tombstones). Over time, the average distance will reach

an equilibrium point because insertions will tend to decrease the average distance

by filling in tombstone slots. For example, after initially loading records into the

database, the average path distance might be 1.2 (i.e., an average of 0.2 accesses

per search beyond the home position will be required). After a series of insertions

and deletions, this average distance might increase to 1.6 due to tombstones. This

seems like a small increase, but it is three times longer on average beyond the home

position than before deletions.

Two possible solutions to this problem are

1. Do a local reorganization upon deletion to try to shorten the average path

length. For example, after deleting a key, continue to follow the probe se-

quence of that key and swap records further down the probe sequence into

the slot of the recently deleted record (being careful not to remove any key

from its probe sequence). This will not work for all collision resolution poli-

cies.

2. Periodically rehash the table by reinserting all records into a new hash table.

Not only will this remove the tombstones, but it also provides an opportunity

to place the most frequently accessed records into their home positions.

9.5

Further Reading

For a comparison of the efficiencies for various self-organizing techniques, see

Bentley and McGeoch, “Amortized Analysis of Self-Organizing Sequential Search

Heuristics” [BM85]. The text compression example of Section 9.2 comes from

Bentley et al., “A Locally Adaptive Data Compression Scheme” [BSTW86]. For

more on Ziv-Lempel coding, see Data Compression: Methods and Theory by

James A. Storer [Sto88]. Knuth covers self-organizing lists and Zipf distributions

in Volume 3 of The Art of Computer Programming[Knu98].

Introduction to Modern Information Retrieval by Salton and McGill [SM83] is

an excellent source for more information about document retrieval techniques.

See the paper “Practical Minimal Perfect Hash Functions for Large Databases”

by Fox et al. [FHCD92] for an introduction and a good algorithm for perfect hash-

ing.

For further details on the analysis for various collision resolution policies, see

Knuth, Volume 3 [Knu98] and Concrete Mathematics: A Foundation for Computer

Science by Graham, Knuth, and Patashnik [GKP94].

The model of hashing presented in this chapter has been of a fixed-size hash

table. A problem not addressed is what to do when the hash table gets half full and

more records must be inserted. This is the domain of dynamic hashing methods.

Search WWH ::

Custom Search

Home