Database Reference
In-Depth Information
The key to the CTM is the logistic normal distribution (2). The logistic
normal is a distribution on the simplex that allows for a general pattern of
variability between the components. It achieves this by mapping a multivari-
ate random variable from R d to the d -simplex.
In particular, the logistic normal distribution takes a draw from a multivari-
ate Gaussian, exponentiates it, and maps it to the simplex via normalization.
The covariance of the Gaussian leads to correlations between components of
the resulting simplicial random variable. The logistic normal was originally
studied in the context of analyzing observed data such as the proportions
of minerals in geological samples. In the CTM, it is used in a hierarchical
model where it describes the hidden composition of topics associated with
each document.
Let
be a K -dimensional mean and covariance matrix, and let top-
ics β 1: K be K multinomials over a fixed word vocabulary, as above. The
CTM assumes that an N -word document arises from the following generative
process:
{
μ, Σ
}
1. Draw η
|{
μ, Σ
}∼N
( μ, Σ).
2. For n
∈{
1 ,...,N
}
:
(a) Draw topic assignment Z n |
η from Mult( f ( η )).
from Mult( β z n ).
The function that maps the real-vector η to the simplex is
(b) Draw word W n |{
z n 1: K }
η i }
j exp
{
exp
f ( η i )=
.
(4.15)
{
η j }
Note that this process is identical to the generative process of LDA from
Section 4.2 except that the topic proportions are drawn from a logistic normal
rather than a Dirichlet. The model is shown as a directed graphical model in
Figure 4.6 .
The CTM is more expressive than LDA because the strong independence
assumption imposed by the Dirichlet in LDA is not realistic when analyzing
real document collections. Quantitative results illustrate that the CTM better
fits held out data than LDA (9). Moreover, this higher order structure given
by the covariance can be used as an exploratory tool for better understanding
and navigating a large corpus. Figure 4.7 illustrates the topics and their con-
nections found by analyzing the same Science corpus as for Figure 4.1 . This
gives a richer way of visualizing and browsing the latent semantic structure
inherent in the corpus.
However, the added flexibility of the CTM comes at a computational cost.
Mean field variational inference for the CTM is not as fast or straightforward
as the algorithm in Figure 4.5 . In particular, the update for the variational
distribution of the topic proportions must be fit by gradient-based optimiza-
tion. See ( 9 ) for details.
Search WWH ::




Custom Search