Database Reference
In-Depth Information
5.3.4
Inferring Topics from Words
The question of classifying documents by topic is a subject that has been studied for dec-
ades, and we shall not go into great detail here. Suffice it to say that topics are characterized
by words that appear surprisingly often in documents on that topic. For example, neither
fullback nor measles appear very often in documents on the Web. But fullback
will appear far more often than average in pages about sports, and measles will appear
far more often than average in pages about medicine.
If we examine the entire Web, or a large, random sample of the Web, we can get the
background frequency of each word. Suppose we then go to a large sample of pages known
to be about a certain topic, say the pages classified under sports by the Open Directory.
Examine the frequencies of words in the sports sample, and identify the words that appear
significantly more frequently in the sports sample than in the background. In making this
judgment, we must be careful to avoid some extremely rare word that appears in the sports
sample with relatively higher frequency. This word is probably a misspelling that happened
to appear only in one or a few of the sports pages. Thus, we probably want to put a floor on
the number of times a word appears, before it can be considered characteristic of a topic.
Once we have identified a large collection of words that appear much more frequently in
the sports sample than in the background, and we do the same for all the topics on our list,
we can examine other pages and classify them by topic. Here is a simple approach. Sup-
pose that S 1 , S 2 , . . . , S k are the sets of words that have been determined to be characteristic
of each of the topics on our list. Let P be the set of words that appear in a given page P .
Compute the Jaccard similarity (recall Section 3.1.1 ) between P and each of the S i . Classify
the page as that topic with the highest Jaccard similarity. Note that all Jaccard similarities
may be very low, especially if the sizes of the sets S i are small. Thus, it is important to pick
reasonably large sets S i to make sure that we cover all aspects of the topic represented by
the set.
We can use this method, or a number of variants, to classify the pages the user has most
recently retrieved. We could say the user is interested in the topic into which the largest
number of these pages fall. Or we could blend the topic-sensitive PageRank vectors in pro-
portion to the fraction of these pages that fall into each topic, thus constructing a single
PageRank vector that reflects the user's current blend of interests. We could also use the
same procedure on the pages that the user currently has bookmarked, or combine the book-
marked pages with the recently viewed pages.
Search WWH ::




Custom Search