Java Reference
In-Depth Information
This method of computing sets from bit vectors is sometimes applied to doc-
ument retrieval. Consider the problem of picking from a collection of documents
those few which contain selected keywords. For each keyword, the document re-
trieval system stores a bit vector with one bit for each document. If the user wants to
know which documents contain a certain three keywords, the corresponding three
bit vectors are AND'ed together. Those bit positions resulting in a value of 1 cor-
respond to the desired documents. Alternatively, a bit vector can be stored for each
document to indicate those keywords appearing in the document. Such an organiza-
tion is called a signature file. The signatures can be manipulated to find documents
with desired combinations of keywords.
9.4
Hashing
This section presents a completely different approach to searching arrays: by direct
access based on key value. The process of finding a record using some computa-
tion to map its key value to a position in the array is called hashing. Most hash-
ing schemes place records in the array in whatever order satisfies the needs of the
address calculation, thus the records are not ordered by value or frequency. The
function that maps key values to positions is called a hash function and will be
denoted by h. The array that holds the records is called the hash table and will be
denoted by HT. A position in the hash table is also known as a slot. The number
of slots in hash table HT will be denoted by the variable M, with slots numbered
from 0 to M 1. The goal for a hashing system is to arrange things such that, for
any key value K and some hash function h, i = h(K) is a slot in the table such
that 0 h(K) < M, and we have the key of the record stored at HT[i] equal to
K.
Hashing is not good for applications where multiple records with the same key
value are permitted. Hashing is not a good method for answering range searches. In
other words, we cannot easily find all records (if any) whose key values fall within
a certain range. Nor can we easily find the record with the minimum or maximum
key value, or visit the records in key order. Hashing is most appropriate for answer-
ing the question, “What record, if any, has key value K?” For applications where
access involves only exact-match queries, hashing is usually the search method of
choice because it is extremely efficient when implemented correctly. As you will
see in this section, however, there are many approaches to hashing and it is easy
to devise an inefficient implementation. Hashing is suitable for both in-memory
and disk-based searching and is one of the two most widely used methods for or-
ganizing large databases stored on disk (the other is the B-tree, which is covered in
Chapter 10).
As a simple (though unrealistic) example of hashing, consider storing n records
each with a unique key value in the range 0 to n 1. In this simple case, a record
Search WWH ::




Custom Search