Databases Reference
In-Depth Information
have classified documents to determine they are about baseball, it is not hard
to notice that words such as these appear unusually frequently. However, until
we have made the classification, it is not possible to identify these words as
characteristic.
Thus, classification often starts by looking at documents, and finding the
significant words in those documents. Our first guess might be that the words
appearing most frequently in a document are the most significant. However,
that intuition is exactly opposite of the truth. The most frequent words will
most surely be the common words such as “the” or “and,” which help build
ideas but do not carry any significance themselves. In fact, the several hundred
most common words in English (called stop words) are often removed from
documents before any attempt to classify them.
In fact, the indicators of the topic are relatively rare words. However, not
all rare words are equally useful as indicators. There are certain words, for
example “notwithstanding” or “albeit,” that appear rarely in a collection of
documents, yet do not tell us anything useful. On the other hand, a word like
“chukker” is probably equally rare, but tips us off that the document is about
the sport of polo. The difference between rare words that tell us something and
those that do not has to do with the concentration of the useful words in just a
few documents. That is, the presence of a word like “albeit” in a document does
not make it terribly more likely that it will appear multiple times. However,
if an article mentions “chukker” once, it is likely to tell us what happened in
the “first chukker,” then the “second chukker,” and so on. That is, the word is
likely to be repeated if it appears at all.
The formal measure of how concentrated into relatively few documents are
the occurrences of a given word is called TF.IDF (Term Frequency times In-
verse Document Frequency). It is normally computed as follows. Suppose we
have a collection of N documents. Define f ij to be the frequency (number of
occurrences) of term (word) i in document j. Then, define the term frequency
TF ij to be:
TF ij = f ij
max k f kj
That is, the term frequency of term i in document j is f ij normalized by dividing
it by the maximum number of occurrences of any term (perhaps excluding stop
words) in the same document. Thus, the most frequent term in document j
gets a TF of 1, and other terms get fractions as their term frequency for this
document.
The IDF for a term is defined as follows. Suppose term i appears in n i
of the N documents in the collection. Then IDF i = log 2 (N/n i ). The TF.IDF
score for term i in document j is then defined to be TF ij ×IDF i . The terms
with the highest TF.IDF score are often the terms that best characterize the
topic of the document.
Example 1.3 : Suppose our repository consists of 2 20
= 1,048,576 documents.
Suppose word w appears in 2 10
= 1024 of these documents.
Then IDF w =
Search WWH ::




Custom Search