Link Analysis - Mining of Massive Datasets

Database Reference

In-Depth Information

5.3.4

Inferring Topics from Words

The question of classifying documents by topic is a subject that has been studied for dec-

ades, and we shall not go into great detail here. Suffice it to say that topics are characterized

by words that appear surprisingly often in documents on that topic. For example, neither

fullback nor measles appear very often in documents on the Web. But fullback

will appear far more often than average in pages about sports, and measles will appear

far more often than average in pages about medicine.

If we examine the entire Web, or a large, random sample of the Web, we can get the

background frequency of each word. Suppose we then go to a large sample of pages known

to be about a certain topic, say the pages classified under sports by the Open Directory.

Examine the frequencies of words in the sports sample, and identify the words that appear

significantly more frequently in the sports sample than in the background. In making this

judgment, we must be careful to avoid some extremely rare word that appears in the sports

sample with relatively higher frequency. This word is probably a misspelling that happened

to appear only in one or a few of the sports pages. Thus, we probably want to put a floor on

the number of times a word appears, before it can be considered characteristic of a topic.

Once we have identified a large collection of words that appear much more frequently in

the sports sample than in the background, and we do the same for all the topics on our list,

we can examine other pages and classify them by topic. Here is a simple approach. Sup-

pose that S 1 , S 2 , . . . , S k are the sets of words that have been determined to be characteristic

of each of the topics on our list. Let P be the set of words that appear in a given page P .

Compute the Jaccard similarity (recall Section 3.1.1 ) between P and each of the S i . Classify

the page as that topic with the highest Jaccard similarity. Note that all Jaccard similarities

may be very low, especially if the sizes of the sets S i are small. Thus, it is important to pick

reasonably large sets S i to make sure that we cover all aspects of the topic represented by

the set.

We can use this method, or a number of variants, to classify the pages the user has most

recently retrieved. We could say the user is interested in the topic into which the largest

number of these pages fall. Or we could blend the topic-sensitive PageRank vectors in pro-

portion to the fraction of these pages that fall into each topic, thus constructing a single

PageRank vector that reflects the user's current blend of interests. We could also use the

same procedure on the pages that the user currently has bookmarked, or combine the book-

marked pages with the recently viewed pages.

Search WWH ::

Custom Search

Home