Data Mining - Mining of Massive Datasets

Database Reference

In-Depth Information

1.3.1

Importance of Words in Documents

In several applications of data mining, we shall be faced with the problem of categorizing

documents (sequences of words) by their topic. Typically, topics are identified by finding

the special words that characterize documents about that topic. For instance, articles about

baseball would tend to have many occurrences of words like “ball,” “bat,” “pitch,”, “run,”

and so on. Once we have classified documents to determine they are about baseball, it is

not hard to notice that words such as these appear unusually frequently. However, until we

have made the classification, it is not possible to identify these words as characteristic.

Thus, classification often starts by looking at documents, and finding the significant

words in those documents. Our first guess might be that the words appearing most fre-

quently in a document are the most significant. However, that intuition is exactly opposite

of the truth. The most frequent words will most surely be the common words such as “the”

or “and,” which help build ideas but do not carry any significance themselves. In fact, the

several hundred most common words in English (called stop words ) are often removed

from documents before any attempt to classify them.

In fact, the indicators of the topic are relatively rare words. However, not all rare words

are equally useful as indicators. There are certain words, for example “notwithstanding” or

“albeit,” that appear rarely in a collection of documents, yet do not tell us anything use-

ful. On the other hand, a word like “chukker” is probably equally rare, but tips us off that

the document is about the sport of polo. The difference between rare words that tell us

something and those that do not has to do with the concentration of the useful words in just

a few documents. That is, the presence of a word like “albeit” in a document does not make

it terribly more likely that it will appear multiple times. However, if an article mentions

“chukker” once, it is likely to tell us what happened in the “first chukker,” then the “second

chukker,” and so on. That is, the word is likely to be repeated if it appears at all.

The formal measure of how concentrated into relatively few documents are the occur-

rences of a given word is called TF.IDF ( Term Frequency times Inverse Document Fre-

quency ). It is normally computed as follows. Suppose we have a collection of N documents.

Define f ij to be the frequency (number of occurrences) of term (word) i in document j . Then,

define the term frequency TF ij to be:

That is, the term frequency of term i in document j is f ij normalized by dividing it by the

maximum number of occurrences of any term (perhaps excluding stop words) in the same

document. Thus, the most frequent term in document j gets a TF of 1, and other terms get

fractions as their term frequency for this document.

Search WWH ::

Custom Search

Home