Databases Reference
In-Depth Information
introduction). The key intuition behind LDA is to model the generation of a document (or of
a collection of documents) as a stochastic process in which words are selected by sampling some
discrete probability distributions. More specifically, we assume that our documents are about a set
of topics, each document is a multinomial distribution over topics, and each topic is a multinomial
distribution over words. Figure
3.2
shows examples of these distributions for a toy LDA model
involving two documents, three topics and a vocabulary of one thousand words.
Sample Multinomial Distributions over Topics
Topic T
1
To p i c T
2
Topic T
3
T
T
Document 1
.1
.7
.2
Document 2
.3
.3
.4
Sample Multinomial Distributions over Words
Word w
1
Word w
2
……
Word w
1000
I
I
I
Topic T
1
.001
.021
……
.00006
Topic T
2
.0021
.006
……
.03
Topic T
3
.0065
.0043
……
.009
Sample Dictionary
Word w
1
Word w
2
……
Word w
1000
ability
abrasion
……
youth
Figure 3.2:
Probability distributions for a sample LDA Topic Model of two documents, involving three
topics and a dictionary of one thousand words.
Assuming that the variable
z
ranges over the topics (
T
1
,T
2
,T
3
in our example), and the variable
w
ranges over the words (
w
1
, ..., w
1000
in the example), we can refer to the topic-word distribution
and document-topic distribution as
φ
(j)
=
P(w
|
z
i
=
j)
(one for each topic) and
θ
(d)
=
P(z)
(one
for each document). Now we can more formally specify the stochastic process that generates all the
words of document
d
in a collection: it consists of repeated samples from
θ
(d)
=
P(z)
to get a topic
and from
φ
(j)
=
P(w
|
z
i
=
j)
to get a word given that topic.
For the more statistically inclined, an LDA model specifies the following distribution over
words within a document, by combining
φ
and
θ
distributions:
P(w
i
)
=
j
=
1
P(w
i
|
z
i
=
j)P(z
i
=
j) ,
where
T
is the number of topics.
P(w
i
|
j)
is the probability of word
w
i
under topic
j
and
P(z
i
=
j)
is the probability that
j
th
topic was sampled for the
i
th
word token.
The power of an LDA formalization is that such a model (i.e., all the probability distributions)
can be effectively learned for any given set of documents. Not only is there an efficient method to
z
i
=
Search WWH ::
Custom Search