Topic Models - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

validation on the error of the task at hand (e.g., information retrieval, text

classification). When the goal is qualitative, such as corpus exploration, one

can use cross validation on predictive likelihood, essentially choosing the num-

ber of topics that provides the best language model. An alternative is to take

a nonparametric Bayesian approach. Hierarchical Dirichlet processes can be

used to develop a topic model in which the number of topics is automatically

selected and may grow as new data is observed (35).

4.4 Dynamic Topic Models and Correlated Topic Models

In this section, we will describe two extensions to LDA: the correlated topic

model and the dynamic topic model. Each embellishes LDA to relax one of

its implicit assumptions. In addition to describing topic models that are more

powerful than LDA, our goal is give the reader an idea of the practice of topic

modeling. Deciding on an appropriate model of a corpus depends both on

what kind of structure is hidden in the data and what kind of structure the

practitioner cares to examine. While LDA may be appropriate for learning a

fixed set of topics, other applications of topic modeling may call for discovering

the connections between topics or modeling topics as changing through time.

4.4.1 The Correlated Topic Model

One limitation of LDA is that it fails to directly model correlation between

the occurrence of topics. In many—indeed most—text corpora, it is natural

to expect that the occurrences of the underlying latent topics will be highly

correlated. In the Science corpus, for example, an article about genetics may

be likely to also be about health and disease, but unlikely to also be about

x-ray astronomy.

In LDA, this modeling limitation stems from the independence assump-

tions implicit in the Dirichlet distribution of the topic proportions. Specifi-

cally, under a Dirichlet, the components of the proportions vector are nearly

independent, which leads to the strong assumption that the presence of one

topic is not correlated with the presence of another. (We say “nearly inde-

pendent” because the components exhibit slight negative correlation because

of the constraint that they have to sum to one.)

In the correlated topic model (CTM), we model the topic proportions with

an alternative, more flexible distribution that allows for covariance structure

among the components (9). This gives a more realistic model of latent topic

structure where the presence of one latent topic may be correlated with the

presence of another. The CTM better fits the data, and provides a rich way

of visualizing and exploring text collections.

Search WWH ::

Custom Search

Home