Data Mining - Mining of Massive Datasets

Databases Reference

In-Depth Information

log 2 (2 20 /2 10 ) = log 2(2 10 ) = 10. Consider a document j in which w appears 20

times, and that is the maximum number of times in which any word appears

(perhaps after eliminating stop words). Then TF wj = 1, and the TF.IDF score

for w in document j is 10.

Suppose that in document k, word w appears once, while the maximum

number of occurrences of any word in this document is 20. Then TF wk = 1/20,

and the TF.IDF score for w in document k is 1/2.

2

1.3.2

Hash Functions

The reader has probably heard of hash tables, and perhaps used them in Java

classes or similar packages. The hash functions that make hash tables feasible

are also essential components in a number of data-mining algorithms, where

the hash table takes an unfamiliar form. We shall review the basics here.

First, a hash function h takes a hash-key value as an argument and produces

a bucket number as a result. The bucket number is an integer, normally in the

range 0 to B−1, where B is the number of buckets. Hash-keys can be of any

type. There is an intuitive property of hash functions that they “randomize”

hash-keys. To be precise, if hash-keys are drawn randomly from a reasonable

population of possible hash-keys, then h will send approximately equal numbers

of hash-keys to each of the B buckets. It would be impossible to do so if, for

example, the population of possible hash-keys were smaller than B. Such a

population would not be “reasonable.” However, there can be more subtle rea-

sons why a hash function fails to achieve an approximately uniform distribution

into buckets.

Example 1.4 : Suppose hash-keys are positive integers. A common and simple

hash function is to pick h(x) = x mod B, that is, the remainder when x is

divided by B. That choice works fine if our population of hash-keys is all

positive integers. 1/Bth of the integers will be assigned to each of the buckets.

However, suppose our population is the even integers, and B = 10. Then only

buckets 0, 2, 4, 6, and 8 can be the value of h(x), and the hash function is

distinctly nonrandom in its behavior. On the other hand, if we picked B = 11,

then we would find that 1/11th of the even integers get sent to each of the 11

buckets, so the hash function would work very well.

2

The generalization of Example 1.4 is that when hash-keys are integers, chos-

ing B so it has any common factor with all (or even most of) the possible hash-

keys will result in nonrandom distribution into buckets. Thus, it is normally

preferred that we choose B to be a prime. That choice reduces the chance of

nonrandom behavior, although we still have to consider the possibility that all

hash-keys have B as a factor. Of course there are many other types of hash

functions not based on modular arithmetic. We shall not try to summarize

the options here, but some sources of information will be mentioned in the

bibliographic notes.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home