Database Reference
In-Depth Information
The IDF for a term is defined as follows. Suppose term i appears in n i of the N documents
in the collection. Then IDF i = log 2 ( N / n i ). The TF.IDF score for term i in document j is then
defined to be TF ij × IDF i . The terms with the highest TF.IDF score are often the terms that
best characterize the topic of the document.
EXAMPLE 1.3 Suppose our repository consists of 2 20 = 1,048,576 documents. Suppose
word w appears in 2 10 = 1024 of these documents. Then IDF w = log 2 (2 20 /2 10 ) = log 2(2 10 )
= 10. Consider a document j in which w appears 20 times, and that is the maximum number
of times in which any word appears (perhaps after eliminating stop words). Then TF wj = 1,
and the TF.IDF score for w in document j is 10.
Suppose that in document k , word w appears once, while the maximum number of oc-
currences of any word in this document is 20. Then TF wk = 1/20, and the TF.IDF score for
w in document k is 1/2.
1.3.2
Hash Functions
The reader has probably heard of hash tables, and perhaps used them in Java classes or
similar packages. The hash functions that make hash tables feasible are also essential com-
ponents in a number of data-mining algorithms, where the hash table takes an unfamiliar
form. We shall review the basics here.
First, a hash function h takes a hash-key value as an argument and produces a bucket
number as a result. The bucket number is an integer, normally in the range 0 to B −1, where
B is the number of buckets. Hash-keys can be of any type. There is an intuitive property of
hash functions that they “randomize” hash-keys. To be precise, if hash-keys are drawn ran-
domly from a reasonable population of possible hash-keys, then h will send approximately
equal numbers of hash-keys to each of the B buckets. It would be impossible to do so if,
for example, the population of possible hash-keys were smaller than B . Such a population
would not be “reasonable.” However, there can be more subtle reasons why a hash function
fails to achieve an approximately uniform distribution into buckets.
EXAMPLE 1.4 Suppose hash-keys are positive integers. A common and simple hash func-
tion is to pick h ( x ) = x mod B , that is, the remainder when x is divided by B . That choice
works fine if our population of hash-keys is all positive integers. 1/ B th of the integers will
be assigned to each of the buckets. However, suppose our population is the even integers,
and B = 10. Then only buckets 0, 2, 4, 6, and 8 can be the value of h ( x ), and the hash func-
tion is distinctly nonrandom in its behavior. On the other hand, if we picked B = 11, then
we would find that 1/11th of the even integers get sent to each of the 11 buckets, so the
hash function would work very well.
Search WWH ::




Custom Search