Topic Models - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

This is inspired by the popular TFIDF term score of vocabulary terms used in

information retrieval (3). The first expression is akin to the term frequency;

the second expression is akin to the document frequency, down-weighting

terms that have high probability under all the topics. Other methods of

determining the difference between a topic and others can be found in (34).

Visualizing a document. We use the posterior topic proportions θ d,k and

posterior topic assignments

z d,n,k to visualize the underlying topic decompo-

sition of a document. Plotting the posterior topic proportions gives a sense

of which topics the document is “about.” These vectors can also be used to

group articles that exhibit certain topics with high proportions. Note that, in

contrast to traditional clustering models (16), articles contain multiple topics

and thus can belong to multiple groups. Finally, examining the most likely

topic assigned to each word gives a sense of how the topics are divided up

within the document.

Finding similar documents. We can further use the posterior topic pro-

portions to define a topic-based similarity measure between documents. These

vectors provide a low dimensional simplicial representation of each document,

reducing their representation from the ( V

−

1)-simplex to the ( K

−

1)-simplex.

One can use the Hellinger distance between documents as a similarity mea-

sure,

θ f,k 2

K

θ d,k −

document-similarity d,f =

.

(4.4)

k =1

To illustrate the above three notions, we examined an approximation to the

posterior distribution derived from the JSTOR archive of Science from 1980-

2002. The corpus contains 21,434 documents comprising 16M words when we

use the 10,000 terms chosen by TFIDF (see Section 4.3.2 ). The model was

fixed to have 50 topics.

We illustrate the analysis of a single article in Figure 4.4 . The figure depicts

the topic proportions, the top scoring words from the most prevalent topics,

the assignment of words to topics in the abstract of the article, and the top

ten most similar articles.

4.3 Posterior Inference for LDA

The central computational problem for topic modeling with LDA is ap-

proximating the posterior in Eq. (4.2). This distribution is the key to using

LDA for both quantitative tasks, such as prediction and document general-

ization, and the qualitative exploratory tasks that we discuss here. Several

approximation techniques have been developed for LDA, including mean field

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home