Database Reference
In-Depth Information
4.2.1 Statistical Assumptions
The idea behind LDA is to model documents as arising from multiple topics,
where a topic is defined to be a distribution over a fixed vocabulary of terms.
Specifically, we assume that K topics are associated with a collection, and
that each document exhibits these topics with different proportions. This is
often a natural assumption to make because documents in a corpus tend to
be heterogeneous, combining a subset of main ideas or themes that permeate
the collection as a whole.
JSTOR's archive of Science , for example, exhibits a variety of fields, but
each document might combine them in novel ways. One document might
be about genetics and neuroscience; another might be about genetics and
technology; a third might be about neuroscience and technology. A model
that limits each document to a single topic cannot capture the essence of
neuroscience in the same way as one which addresses that topics are only
expressed in part in each document. The challenge is that these topics are
not known in advance; our goal is to learn them from the data.
More formally, LDA casts this intuition into a hidden variable model of
documents. Hidden variable models are structured distributions in which
observed data interact with hidden random variables. With a hidden vari-
able model, the practitioner posits a hidden structure in the observed data,
and then learns that structure using posterior probabilistic inference. Hidden
variable models are prevalent in machine learning; examples include hidden
Markov models (30), Kalman filters (22), phylogenetic tree models (24), and
mixture models (25).
In LDA, the observed data are the words of each document and the hidden
variables represent the latent topical structure, i.e., the topics themselves and
how each document exhibits them. Given a collection, the posterior distri-
bution of the hidden variables given the observed documents determines a
hidden topical decomposition of the collection. Applications of topic model-
ing use posterior estimates of these hidden variables to perform tasks such as
information retrieval and document browsing.
The interaction between the observed documents and hidden topic struc-
ture is manifest in the probabilistic generative process associated with LDA,
the imaginary random process that is assumed to have produced the observed
data. Let K be a specified number of topics, V the size of the vocabulary, α
a positive K -vector, and η a scalar. We let Dir V ( α )denotea V -dimensional
Dirichlet with vector parameter α and Dir K ( η )denotea K dimensional sym-
metric Dirichlet with scalar parameter η .
1. For each topic,
(a) Draw a distribution over words β k
Dir V ( η ).
2. For each document,
(a) Draw a vector of topic proportions θ d
Dir( α ).
(b) For each word,
Search WWH ::




Custom Search