Databases Reference
In-Depth Information
introduction). The key intuition behind LDA is to model the generation of a document (or of
a collection of documents) as a stochastic process in which words are selected by sampling some
discrete probability distributions. More specifically, we assume that our documents are about a set
of topics, each document is a multinomial distribution over topics, and each topic is a multinomial
distribution over words. Figure 3.2 shows examples of these distributions for a toy LDA model
involving two documents, three topics and a vocabulary of one thousand words.
Sample Multinomial Distributions over Topics
Topic T 1
To p i c T 2
Topic T 3
T
T
Document 1
.1
.7
.2
Document 2
.3
.3
.4
Sample Multinomial Distributions over Words
Word w 1
Word w 2
……
Word w 1000
I
I
I
Topic T 1
.001
.021
……
.00006
Topic T 2
.0021
.006
……
.03
Topic T 3
.0065
.0043
……
.009
Sample Dictionary
Word w 1
Word w 2
……
Word w 1000
ability
abrasion
……
youth
Figure 3.2: Probability distributions for a sample LDA Topic Model of two documents, involving three
topics and a dictionary of one thousand words.
Assuming that the variable z ranges over the topics ( T 1 ,T 2 ,T 3 in our example), and the variable
w ranges over the words ( w 1 , ..., w 1000 in the example), we can refer to the topic-word distribution
and document-topic distribution as φ (j)
= P(w | z i = j) (one for each topic) and θ (d)
= P(z) (one
for each document). Now we can more formally specify the stochastic process that generates all the
words of document d in a collection: it consists of repeated samples from θ (d)
= P(z) to get a topic
and from φ (j)
= P(w | z i = j) to get a word given that topic.
For the more statistically inclined, an LDA model specifies the following distribution over
words within a document, by combining φ and θ distributions:
P(w i ) = j = 1 P(w i | z i = j)P(z i = j) ,
where T is the number of topics. P(w i |
j) is the probability of word w i under topic j and
P(z i = j) is the probability that j th topic was sampled for the i th word token.
The power of an LDA formalization is that such a model (i.e., all the probability distributions)
can be effectively learned for any given set of documents. Not only is there an efficient method to
z i =
 
Search WWH ::




Custom Search