Databases Reference
In-Depth Information
estimate φ and θ (variational EM), but one can also directly estimate (by Gibbs sampling) the
posterior distribution over z = P(z i = j | w i ) ; namely, a topic assignment for words which specifies
how likely is a topic given a certain word.
By assuming the words in a sentence occur independently, a topic assignment for words allows
us to also compute a topic assignment for sentences as follows: the topic for a given sentence s should
be the one with the highest probability given all the words in s , formally:
j =
argmax j P(z i =
j
|
s),
where
P(z i =
j
|
s)
=
P(z i =
j
|
w 1 , ..., w n )
=
P(z i =
j
|
w i ).
w i s
As a final step, LDA enables topic segmentation. Once we have assigned to each sentence
its most likely topic, blocks of adjacent sentences sharing the same topic will constitute the topical
segments.
If the reader is familiar with graphical models (e.g., Bayesian Networks), an LDA model is a
probabilistic generative graphical model that formally describes the generation of all the documents
in a collection. Figure 3.3 shows the model in plate notation 2 for D documents, T topics and N d
words in each document d . The only variables that are observed in the graphical model (grayed in
the Figure) are the ones corresponding to the words in the documents, while all the other variables
are hidden and include z , a topic assignment for each word in each document, the distributions θ (d)
and φ (j) , as well as Dirichlet priors for those, α and η , respectively, (from which LDA gets its name).
Figure 3.3: Graphical model for LDA in plate notation. The only observed variables are the w nodes
corresponding to words. As the plate notation shows, the model contains a node for each word, in each
document of the collection.
2 Plate notation is a concise way to show complex graphical models, in which some subgraphs are replicated for each element in a
set. More specifically, each subgraph encircled in a box must be replicated for each element of the set denoted by the box label.
For example, in Figure 3.3 the subgraph contained in the biggest box must be replicated for each document in the set D .For
more details on the plate notation, see Poole and Mackworth [ 2010 ].
Search WWH ::




Custom Search