Data Mining - Mining of Massive Datasets

Database Reference

In-Depth Information

The generalization of Example 1.4 is that when hash-keys are integers, chosing B so it

has any common factor with all (or even most of) the possible hash-keys will result in non-

random distribution into buckets. Thus, it is normally preferred that we choose B to be a

prime. That choice reduces the chance of nonrandom behavior, although we still have to

consider the possibility that all hash-keys have B as a factor. Of course there are many other

types of hash functions not based on modular arithmetic. We shall not try to summarize the

options here, but some sources of information will be mentioned in the bibliographic notes.

What if hash-keys are not integers? In a sense, all data types have values that are com-

posed of bits, and sequences of bits can always be interpreted as integers. However, there

are some simple rules that enable us to convert common types to integers. For example,

if hash-keys are strings, convert each character to its ASCII or Unicode equivalent, which

can be interpreted as a small integer. Sum the integers before dividing by B . As long as

B is smaller than the typical sum of character codes for the population of strings, the dis-

tribution into buckets will be relatively uniform. If B is larger, then we can partition the

characters of a string into groups of several characters each. Treat the concatenation of the

codes for the characters of a group as a single integer. Sum the integers associated with all

the groups of a string, and divide by B as before. For instance, if B is around a billion, or

2 30 , then grouping characters four at a time will give us 32-bit integers. The sum of several

of these will distribute fairly evenly into a billion buckets.

For more complex data types, we can extend the idea used for converting strings to in-

tegers, recursively.

• For a type that is a record, each of whose components has its own type, recursively

convert the value of each component to an integer, using the algorithm appropriate

for the type of that component. Sum the integers for the components, and convert

the integer sum to buckets by dividing by B .

• For a type that is an array, set, or bag of elements of some one type, convert the

values of the elements' type to integers, sum the integers, and divide by B .

1.3.3

Indexes

An index is a data structure that makes it efficient to retrieve objects given the value of one

or more elements of those objects. The most common situation is one where the objects

are records, and the index is on one of the fields of that record. Given a value v for that

field, the index lets us retrieve all the records with value v in that field. For example, we

could have a file of (name, address, phone) triples, and an index on the phone field. Given

a phone number, the index allows us to find quickly the record or records with that phone

number.

Search WWH ::

Custom Search

Home