Database Reference
In-Depth Information
α
η
Z d , n
W d , n
β k
θ d
N D
K
FIGURE 4.2 : A graphical model representation of the latent Dirichlet al-
location (LDA). Nodes denote random variables; edges denote dependence
between random variables. Shaded nodes denote observed random variables;
unshaded nodes denote hidden random variables. The rectangular boxes are
“plate notation,” which denote replication.
Mult( θ d ), Z d,n ∈{
i. Draw a topic assignment Z d,n
1 ,...,K
}
.
Mult( β z d,n ), W d,n ∈{
ii. Draw a word W d,n
1 ,...,V
}
.
This is illustrated as a directed graphical model in Figure 4.2.
The hidden topical structure of a collection is represented in the hidden
random variables: the topics β 1: K , the per-document topic proportions θ 1: D ,
and the per-word topic assignments z 1: D, 1: N . With these variables, LDA
is a type of mixed-membership model (14). These are distinguished from
classical mixture models (25; 27), where each document is limited to exhibit
one topic. This additional structure is important because, as we have noted,
documents often exhibit multiple topics; LDA can model this heterogeneity
while classical mixtures cannot. Advantages of LDA over classical mixtures
have been quantified by measuring document generalization (10).
LDA makes central use of the Dirichlet distribution, the exponential family
distribution over the simplex of positive vectors that sum to one. The Dirichlet
has density
p ( θ | α )= Γ( i α i )
i Γ( α i )
i
θ α i 1
i
.
(4.1)
The parameter α is a positive K -vector, and Γ denotes the Gamma function,
which can be thought of as a real-valued extension of the factorial function.
A symmetric Dirichlet is a Dirichlet where each component of the parameter
is equal to the same value. The Dirichlet is used as a distribution over dis-
crete distributions; each component in the random vector is the probability
of drawing the item associated with that component.
LDA contains two Dirichlet random variables: the topic proportions θ are
distributions over topic indices
; the topics β are distributions over
the vocabulary. In Section 4.4.2 and Section 4.4.1, we will examine some of
the properties of the Dirichlet, and replace these modeling choices with an
alternative distribution over the simplex.
{
1 ,...,K
}
Search WWH ::




Custom Search