Searching - Data Structures and Algorithm Analysis

Java Reference

In-Depth Information

birthday is similar to assigning records to slots in a table (of size 365) using the

birthday as a hash function. Note that this observation tells us nothing about which

students share a birthday, or on which days of the year shared birthdays fall.

To be practical, a database organized by hashing must store records in a hash

table that is not so large that it wastes space. Typically, this means that the hash

table will be around half full. Because collisions are extremely likely to occur

under these conditions (by chance, any record inserted into a table that is half full

will have a collision half of the time), does this mean that we need not worry about

the ability of a hash function to avoid collisions? Absolutely not. The difference

between a good hash function and a bad hash function makes a big difference in

practice. Technically, any function that maps all possible key values to a slot in

the hash table is a hash function. In the extreme case, even a function that maps

all records to the same slot is a hash function, but it does nothing to help us find

records during a search operation.

We would like to pick a hash function that stores the actual records in the col-

lection such that each slot in the hash table has equal probability of being filled. Un-

fortunately, we normally have no control over the key values of the actual records,

so how well any particular hash function does this depends on the distribution of

the keys within the allowable key range. In some cases, incoming data are well

distributed across their key range. For example, if the input is a set of random

numbers selected uniformly from the key range, any hash function that assigns the

key range so that each slot in the hash table receives an equal share of the range

will likely also distribute the input records uniformly within the table. However,

in many applications the incoming records are highly clustered or otherwise poorly

distributed. When input records are not well distributed throughout the key range

it can be difficult to devise a hash function that does a good job of distributing the

records throughout the table, especially if the input distribution is not known in

advance.

There are many reasons why data values might be poorly distributed.

1. Natural frequency distributions tend to follow a common pattern where a few

of the entities occur frequently while most entities occur relatively rarely.

For example, consider the populations of the 100 largest cities in the United

States. If you plot these populations on a number line, most of them will be

clustered toward the low side, with a few outliers on the high side. This is an

example of a Zipf distribution (see Section 9.2). Viewed the other way, the

home town for a given person is far more likely to be a particular large city

than a particular small town.

2. Collected data are likely to be skewed in some way. Field samples might be

rounded to, say, the nearest 5 (i.e., all numbers end in 5 or 0).

3. If the input is a collection of common English words, the beginning letter

will be poorly distributed.

Search WWH ::

Custom Search

Home