Java Reference
In-Depth Information
birthday is similar to assigning records to slots in a table (of size 365) using the
birthday as a hash function. Note that this observation tells us nothing about which
students share a birthday, or on which days of the year shared birthdays fall.
To be practical, a database organized by hashing must store records in a hash
table that is not so large that it wastes space. Typically, this means that the hash
table will be around half full. Because collisions are extremely likely to occur
under these conditions (by chance, any record inserted into a table that is half full
will have a collision half of the time), does this mean that we need not worry about
the ability of a hash function to avoid collisions? Absolutely not. The difference
between a good hash function and a bad hash function makes a big difference in
practice. Technically, any function that maps all possible key values to a slot in
the hash table is a hash function. In the extreme case, even a function that maps
all records to the same slot is a hash function, but it does nothing to help us find
records during a search operation.
We would like to pick a hash function that stores the actual records in the col-
lection such that each slot in the hash table has equal probability of being filled. Un-
fortunately, we normally have no control over the key values of the actual records,
so how well any particular hash function does this depends on the distribution of
the keys within the allowable key range. In some cases, incoming data are well
distributed across their key range. For example, if the input is a set of random
numbers selected uniformly from the key range, any hash function that assigns the
key range so that each slot in the hash table receives an equal share of the range
will likely also distribute the input records uniformly within the table. However,
in many applications the incoming records are highly clustered or otherwise poorly
distributed. When input records are not well distributed throughout the key range
it can be difficult to devise a hash function that does a good job of distributing the
records throughout the table, especially if the input distribution is not known in
advance.
There are many reasons why data values might be poorly distributed.
1. Natural frequency distributions tend to follow a common pattern where a few
of the entities occur frequently while most entities occur relatively rarely.
For example, consider the populations of the 100 largest cities in the United
States. If you plot these populations on a number line, most of them will be
clustered toward the low side, with a few outliers on the high side. This is an
example of a Zipf distribution (see Section 9.2). Viewed the other way, the
home town for a given person is far more likely to be a particular large city
than a particular small town.
2. Collected data are likely to be skewed in some way. Field samples might be
rounded to, say, the nearest 5 (i.e., all numbers end in 5 or 0).
3. If the input is a collection of common English words, the beginning letter
will be poorly distributed.
Search WWH ::




Custom Search