Database Reference
In-Depth Information
This is inspired by the popular TFIDF term score of vocabulary terms used in
information retrieval (3). The first expression is akin to the term frequency;
the second expression is akin to the document frequency, down-weighting
terms that have high probability under all the topics. Other methods of
determining the difference between a topic and others can be found in (34).
Visualizing a document. We use the posterior topic proportions θ d,k and
posterior topic assignments
z d,n,k to visualize the underlying topic decompo-
sition of a document. Plotting the posterior topic proportions gives a sense
of which topics the document is “about.” These vectors can also be used to
group articles that exhibit certain topics with high proportions. Note that, in
contrast to traditional clustering models (16), articles contain multiple topics
and thus can belong to multiple groups. Finally, examining the most likely
topic assigned to each word gives a sense of how the topics are divided up
within the document.
Finding similar documents. We can further use the posterior topic pro-
portions to define a topic-based similarity measure between documents. These
vectors provide a low dimensional simplicial representation of each document,
reducing their representation from the ( V
1)-simplex to the ( K
1)-simplex.
One can use the Hellinger distance between documents as a similarity mea-
sure,
θ f,k 2
K
θ d,k
document-similarity d,f =
.
(4.4)
k =1
To illustrate the above three notions, we examined an approximation to the
posterior distribution derived from the JSTOR archive of Science from 1980-
2002. The corpus contains 21,434 documents comprising 16M words when we
use the 10,000 terms chosen by TFIDF (see Section 4.3.2 ). The model was
fixed to have 50 topics.
We illustrate the analysis of a single article in Figure 4.4 . The figure depicts
the topic proportions, the top scoring words from the most prevalent topics,
the assignment of words to topics in the abstract of the article, and the top
ten most similar articles.
4.3 Posterior Inference for LDA
The central computational problem for topic modeling with LDA is ap-
proximating the posterior in Eq. (4.2). This distribution is the key to using
LDA for both quantitative tasks, such as prediction and document general-
ization, and the qualitative exploratory tasks that we discuss here. Several
approximation techniques have been developed for LDA, including mean field
 
Search WWH ::




Custom Search