Mining Text Conversations - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

estimate φ and θ (variational EM), but one can also directly estimate (by Gibbs sampling) the

posterior distribution over z = P(z i = j | w i ) ; namely, a topic assignment for words which specifies

how likely is a topic given a certain word.

By assuming the words in a sentence occur independently, a topic assignment for words allows

us to also compute a topic assignment for sentences as follows: the topic for a given sentence s should

be the one with the highest probability given all the words in s , formally:

j ∗ =

argmax j P(z i =

j

|

s),

where

P(z i =

j

|

s)

=

P(z i =

j

|

w 1 , ..., w n )

=

P(z i =

j

|

w i ).

w i ∈ s

As a final step, LDA enables topic segmentation. Once we have assigned to each sentence

its most likely topic, blocks of adjacent sentences sharing the same topic will constitute the topical

segments.

If the reader is familiar with graphical models (e.g., Bayesian Networks), an LDA model is a

probabilistic generative graphical model that formally describes the generation of all the documents

in a collection. Figure 3.3 shows the model in plate notation 2 for D documents, T topics and N d

words in each document d . The only variables that are observed in the graphical model (grayed in

the Figure) are the ones corresponding to the words in the documents, while all the other variables

are hidden and include z , a topic assignment for each word in each document, the distributions θ (d)

and φ (j) , as well as Dirichlet priors for those, α and η , respectively, (from which LDA gets its name).

Figure 3.3: Graphical model for LDA in plate notation. The only observed variables are the w nodes

corresponding to words. As the plate notation shows, the model contains a node for each word, in each

document of the collection.

2 Plate notation is a concise way to show complex graphical models, in which some subgraphs are replicated for each element in a

set. More specifically, each subgraph encircled in a box must be replicated for each element of the set denoted by the box label.

For example, in Figure 3.3 the subgraph contained in the biggest box must be replicated for each document in the set D .For

more details on the plate notation, see Poole and Mackworth [ 2010 ].

Methods for Mining and Summarizing Text Conversations

Search WWH ::

Custom Search

Home