Databases Reference
In-Depth Information
log 2 (2 20 /2 10 ) = log 2(2 10 ) = 10. Consider a document j in which w appears 20
times, and that is the maximum number of times in which any word appears
(perhaps after eliminating stop words). Then TF wj = 1, and the TF.IDF score
for w in document j is 10.
Suppose that in document k, word w appears once, while the maximum
number of occurrences of any word in this document is 20. Then TF wk = 1/20,
and the TF.IDF score for w in document k is 1/2.
2
1.3.2
Hash Functions
The reader has probably heard of hash tables, and perhaps used them in Java
classes or similar packages. The hash functions that make hash tables feasible
are also essential components in a number of data-mining algorithms, where
the hash table takes an unfamiliar form. We shall review the basics here.
First, a hash function h takes a hash-key value as an argument and produces
a bucket number as a result. The bucket number is an integer, normally in the
range 0 to B−1, where B is the number of buckets. Hash-keys can be of any
type. There is an intuitive property of hash functions that they “randomize”
hash-keys. To be precise, if hash-keys are drawn randomly from a reasonable
population of possible hash-keys, then h will send approximately equal numbers
of hash-keys to each of the B buckets. It would be impossible to do so if, for
example, the population of possible hash-keys were smaller than B. Such a
population would not be “reasonable.” However, there can be more subtle rea-
sons why a hash function fails to achieve an approximately uniform distribution
into buckets.
Example 1.4 : Suppose hash-keys are positive integers. A common and simple
hash function is to pick h(x) = x mod B, that is, the remainder when x is
divided by B. That choice works fine if our population of hash-keys is all
positive integers. 1/Bth of the integers will be assigned to each of the buckets.
However, suppose our population is the even integers, and B = 10. Then only
buckets 0, 2, 4, 6, and 8 can be the value of h(x), and the hash function is
distinctly nonrandom in its behavior. On the other hand, if we picked B = 11,
then we would find that 1/11th of the even integers get sent to each of the 11
buckets, so the hash function would work very well.
2
The generalization of Example 1.4 is that when hash-keys are integers, chos-
ing B so it has any common factor with all (or even most of) the possible hash-
keys will result in nonrandom distribution into buckets. Thus, it is normally
preferred that we choose B to be a prime. That choice reduces the chance of
nonrandom behavior, although we still have to consider the possibility that all
hash-keys have B as a factor. Of course there are many other types of hash
functions not based on modular arithmetic. We shall not try to summarize
the options here, but some sources of information will be mentioned in the
bibliographic notes.
Search WWH ::




Custom Search