Searching - Data Structures and Algorithm Analysis

Java Reference

In-Depth Information

This method of computing sets from bit vectors is sometimes applied to doc-

ument retrieval. Consider the problem of picking from a collection of documents

those few which contain selected keywords. For each keyword, the document re-

trieval system stores a bit vector with one bit for each document. If the user wants to

know which documents contain a certain three keywords, the corresponding three

bit vectors are AND'ed together. Those bit positions resulting in a value of 1 cor-

respond to the desired documents. Alternatively, a bit vector can be stored for each

document to indicate those keywords appearing in the document. Such an organiza-

tion is called a signature file. The signatures can be manipulated to find documents

with desired combinations of keywords.

9.4

Hashing

This section presents a completely different approach to searching arrays: by direct

access based on key value. The process of finding a record using some computa-

tion to map its key value to a position in the array is called hashing. Most hash-

ing schemes place records in the array in whatever order satisfies the needs of the

address calculation, thus the records are not ordered by value or frequency. The

function that maps key values to positions is called a hash function and will be

denoted by h. The array that holds the records is called the hash table and will be

denoted by HT. A position in the hash table is also known as a slot. The number

of slots in hash table HT will be denoted by the variable M, with slots numbered

from 0 to M 1. The goal for a hashing system is to arrange things such that, for

any key value K and some hash function h, i = h(K) is a slot in the table such

that 0 h(K) < M, and we have the key of the record stored at HT[i] equal to

K.

Hashing is not good for applications where multiple records with the same key

value are permitted. Hashing is not a good method for answering range searches. In

other words, we cannot easily find all records (if any) whose key values fall within

a certain range. Nor can we easily find the record with the minimum or maximum

key value, or visit the records in key order. Hashing is most appropriate for answer-

ing the question, “What record, if any, has key value K?” For applications where

access involves only exact-match queries, hashing is usually the search method of

choice because it is extremely efficient when implemented correctly. As you will

see in this section, however, there are many approaches to hashing and it is easy

to devise an inefficient implementation. Hashing is suitable for both in-memory

and disk-based searching and is one of the two most widely used methods for or-

ganizing large databases stored on disk (the other is the B-tree, which is covered in

Chapter 10).

As a simple (though unrealistic) example of hashing, consider storing n records

each with a unique key value in the range 0 to n 1. In this simple case, a record

Search WWH ::

Custom Search

Home