Database Reference
In-Depth Information
validation on the error of the task at hand (e.g., information retrieval, text
classification). When the goal is qualitative, such as corpus exploration, one
can use cross validation on predictive likelihood, essentially choosing the num-
ber of topics that provides the best language model. An alternative is to take
a nonparametric Bayesian approach. Hierarchical Dirichlet processes can be
used to develop a topic model in which the number of topics is automatically
selected and may grow as new data is observed (35).
4.4 Dynamic Topic Models and Correlated Topic Models
In this section, we will describe two extensions to LDA: the correlated topic
model and the dynamic topic model. Each embellishes LDA to relax one of
its implicit assumptions. In addition to describing topic models that are more
powerful than LDA, our goal is give the reader an idea of the practice of topic
modeling. Deciding on an appropriate model of a corpus depends both on
what kind of structure is hidden in the data and what kind of structure the
practitioner cares to examine. While LDA may be appropriate for learning a
fixed set of topics, other applications of topic modeling may call for discovering
the connections between topics or modeling topics as changing through time.
4.4.1 The Correlated Topic Model
One limitation of LDA is that it fails to directly model correlation between
the occurrence of topics. In many—indeed most—text corpora, it is natural
to expect that the occurrences of the underlying latent topics will be highly
correlated. In the Science corpus, for example, an article about genetics may
be likely to also be about health and disease, but unlikely to also be about
x-ray astronomy.
In LDA, this modeling limitation stems from the independence assump-
tions implicit in the Dirichlet distribution of the topic proportions. Specifi-
cally, under a Dirichlet, the components of the proportions vector are nearly
independent, which leads to the strong assumption that the presence of one
topic is not correlated with the presence of another. (We say “nearly inde-
pendent” because the components exhibit slight negative correlation because
of the constraint that they have to sum to one.)
In the correlated topic model (CTM), we model the topic proportions with
an alternative, more flexible distribution that allows for covariance structure
among the components (9). This gives a more realistic model of latent topic
structure where the presence of one latent topic may be correlated with the
presence of another. The CTM better fits the data, and provides a rich way
of visualizing and exploring text collections.
 
Search WWH ::




Custom Search