Topic Models - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

The key to the CTM is the logistic normal distribution (2). The logistic

normal is a distribution on the simplex that allows for a general pattern of

variability between the components. It achieves this by mapping a multivari-

ate random variable from R d to the d -simplex.

In particular, the logistic normal distribution takes a draw from a multivari-

ate Gaussian, exponentiates it, and maps it to the simplex via normalization.

The covariance of the Gaussian leads to correlations between components of

the resulting simplicial random variable. The logistic normal was originally

studied in the context of analyzing observed data such as the proportions

of minerals in geological samples. In the CTM, it is used in a hierarchical

model where it describes the hidden composition of topics associated with

each document.

Let

be a K -dimensional mean and covariance matrix, and let top-

ics β 1: K be K multinomials over a fixed word vocabulary, as above. The

CTM assumes that an N -word document arises from the following generative

process:

{

μ, Σ

}

1. Draw η

|{

μ, Σ

}∼N

( μ, Σ).

2. For n

∈{

1 ,...,N

}

:

(a) Draw topic assignment Z n |

η from Mult( f ( η )).

from Mult( β z n ).

The function that maps the real-vector η to the simplex is

(b) Draw word W n |{

z n ,β 1: K }

η i }

j exp

{

exp

f ( η i )=

.

(4.15)

{

η j }

Note that this process is identical to the generative process of LDA from

Section 4.2 except that the topic proportions are drawn from a logistic normal

rather than a Dirichlet. The model is shown as a directed graphical model in

Figure 4.6 .

The CTM is more expressive than LDA because the strong independence

assumption imposed by the Dirichlet in LDA is not realistic when analyzing

real document collections. Quantitative results illustrate that the CTM better

fits held out data than LDA (9). Moreover, this higher order structure given

by the covariance can be used as an exploratory tool for better understanding

and navigating a large corpus. Figure 4.7 illustrates the topics and their con-

nections found by analyzing the same Science corpus as for Figure 4.1 . This

gives a richer way of visualizing and browsing the latent semantic structure

inherent in the corpus.

However, the added flexibility of the CTM comes at a computational cost.

Mean field variational inference for the CTM is not as fast or straightforward

as the algorithm in Figure 4.5 . In particular, the update for the variational

distribution of the topic proportions must be fit by gradient-based optimiza-

tion. See ( 9 ) for details.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home