Data Mining - Mining of Massive Datasets

Databases Reference

In-Depth Information

What if hash-keys are not integers? In a sense, all data types have values

that are composed of bits, and sequences of bits can always be interpreted as in-

tegers. However, there are some simple rules that enable us to convert common

types to integers. For example, if hash-keys are strings, convert each character

to its ASCII or Unicode equivalent, which can be interpreted as a small inte-

ger. Sum the integers before dividing by B. As long as B is smaller than the

typical sum of character codes for the population of strings, the distribution

into buckets will be relatively uniform. If B is larger, then we can partition the

characters of a string into groups of several characters each. Treat the concate-

nation of the codes for the characters of a group as a single integer. Sum the

integers associated with all the groups of a string, and divide by B as before.

For instance, if B is around a billion, or 2 30 , then grouping characters four at

a time will give us 32-bit integers. The sum of several of these will distribute

fairly evenly into a billion buckets.

For more complex data types, we can extend the idea used for converting

strings to integers, recursively.

•For a type that is a record, each of whose components has its own type,

recursively convert the value of each component to an integer, using the

algorithm appropriate for the type of that component. Sum the integers

for the components, and convert the integer sum to buckets by dividing

by B.

•For a type that is an array, set, or bag of elements of some one type,

convert the values of the elements' type to integers, sum the integers, and

divide by B.

1.3.3

Indexes

An index is a data structure that makes it e cient to retrieve objects given the

value of one or more elements of those objects. The most common situation

is one where the objects are records, and the index is on one of the fields

of that record. Given a value v for that field, the index lets us retrieve all

the records with value v in that field. For example, we could have a file of

(name, address, phone) triples, and an index on the phone field. Given a phone

number, the index allows us to find quickly the record or records with that

phone number.

There are many ways to implement indexes, and we shall not attempt to

survey the matter here. The bibliographic notes give suggestions for further

reading. However, a hash table is one simple way to build an index. The field

or fields on which the index is based form the hash-key for a hash function.

Records have the hash function applied to value of the hash-key, and the record

itself is placed in the bucket whose number is determined by the hash function.

The bucket could be a list of records in main-memory, or a disk block, for

example.

Search WWH ::

Custom Search

Home