Data Mining - Mining of Massive Datasets

Databases Reference

In-Depth Information

have classified documents to determine they are about baseball, it is not hard

to notice that words such as these appear unusually frequently. However, until

we have made the classification, it is not possible to identify these words as

characteristic.

Thus, classification often starts by looking at documents, and finding the

significant words in those documents. Our first guess might be that the words

appearing most frequently in a document are the most significant. However,

that intuition is exactly opposite of the truth. The most frequent words will

most surely be the common words such as “the” or “and,” which help build

ideas but do not carry any significance themselves. In fact, the several hundred

most common words in English (called stop words) are often removed from

documents before any attempt to classify them.

In fact, the indicators of the topic are relatively rare words. However, not

all rare words are equally useful as indicators. There are certain words, for

example “notwithstanding” or “albeit,” that appear rarely in a collection of

documents, yet do not tell us anything useful. On the other hand, a word like

“chukker” is probably equally rare, but tips us off that the document is about

the sport of polo. The difference between rare words that tell us something and

those that do not has to do with the concentration of the useful words in just a

few documents. That is, the presence of a word like “albeit” in a document does

not make it terribly more likely that it will appear multiple times. However,

if an article mentions “chukker” once, it is likely to tell us what happened in

the “first chukker,” then the “second chukker,” and so on. That is, the word is

likely to be repeated if it appears at all.

The formal measure of how concentrated into relatively few documents are

the occurrences of a given word is called TF.IDF (Term Frequency times In-

verse Document Frequency). It is normally computed as follows. Suppose we

have a collection of N documents. Define f ij to be the frequency (number of

occurrences) of term (word) i in document j. Then, define the term frequency

TF ij to be:

TF ij = f ij

max k f kj

That is, the term frequency of term i in document j is f ij normalized by dividing

it by the maximum number of occurrences of any term (perhaps excluding stop

words) in the same document. Thus, the most frequent term in document j

gets a TF of 1, and other terms get fractions as their term frequency for this

document.

The IDF for a term is defined as follows. Suppose term i appears in n i

of the N documents in the collection. Then IDF i = log 2 (N/n i ). The TF.IDF

score for term i in document j is then defined to be TF ij ×IDF i . The terms

with the highest TF.IDF score are often the terms that best characterize the

topic of the document.

Example 1.3 : Suppose our repository consists of 2 20

= 1,048,576 documents.

Suppose word w appears in 2 10

= 1024 of these documents.

Then IDF w =

Mining of Massive Datasets

Search WWH ::

Custom Search

Home