Data Mining - Mining of Massive Datasets

Database Reference

In-Depth Information

The IDF for a term is defined as follows. Suppose term i appears in n i of the N documents

in the collection. Then IDF i = log 2 ( N / n i ). The TF.IDF score for term i in document j is then

defined to be TF ij × IDF i . The terms with the highest TF.IDF score are often the terms that

best characterize the topic of the document.

EXAMPLE 1.3 Suppose our repository consists of 2 20 = 1,048,576 documents. Suppose

word w appears in 2 10 = 1024 of these documents. Then IDF w = log 2 (2 20 /2 10 ) = log 2(2 10 )

= 10. Consider a document j in which w appears 20 times, and that is the maximum number

of times in which any word appears (perhaps after eliminating stop words). Then TF wj = 1,

and the TF.IDF score for w in document j is 10.

Suppose that in document k , word w appears once, while the maximum number of oc-

currences of any word in this document is 20. Then TF wk = 1/20, and the TF.IDF score for

w in document k is 1/2.

□

1.3.2

Hash Functions

The reader has probably heard of hash tables, and perhaps used them in Java classes or

similar packages. The hash functions that make hash tables feasible are also essential com-

ponents in a number of data-mining algorithms, where the hash table takes an unfamiliar

form. We shall review the basics here.

First, a hash function h takes a hash-key value as an argument and produces a bucket

number as a result. The bucket number is an integer, normally in the range 0 to B −1, where

B is the number of buckets. Hash-keys can be of any type. There is an intuitive property of

hash functions that they “randomize” hash-keys. To be precise, if hash-keys are drawn ran-

domly from a reasonable population of possible hash-keys, then h will send approximately

equal numbers of hash-keys to each of the B buckets. It would be impossible to do so if,

for example, the population of possible hash-keys were smaller than B . Such a population

would not be “reasonable.” However, there can be more subtle reasons why a hash function

fails to achieve an approximately uniform distribution into buckets.

EXAMPLE 1.4 Suppose hash-keys are positive integers. A common and simple hash func-

tion is to pick h ( x ) = x mod B , that is, the remainder when x is divided by B . That choice

works fine if our population of hash-keys is all positive integers. 1/ B th of the integers will

be assigned to each of the buckets. However, suppose our population is the even integers,

and B = 10. Then only buckets 0, 2, 4, 6, and 8 can be the value of h ( x ), and the hash func-

tion is distinctly nonrandom in its behavior. On the other hand, if we picked B = 11, then

we would find that 1/11th of the even integers get sent to each of the 11 buckets, so the

hash function would work very well.

□

Search WWH ::

Custom Search

Home