Mining Text Conversations - Methods for Mining and Summarizing Text Conversations - page 47

Databases Reference

In-Depth Information

introduction). The key intuition behind LDA is to model the generation of a document (or of

a collection of documents) as a stochastic process in which words are selected by sampling some

discrete probability distributions. More specifically, we assume that our documents are about a set

of topics, each document is a multinomial distribution over topics, and each topic is a multinomial

distribution over words. Figure 3.2 shows examples of these distributions for a toy LDA model

involving two documents, three topics and a vocabulary of one thousand words.

Sample Multinomial Distributions over Topics

Topic T 1

To p i c T 2

Topic T 3

T

T

Document 1

.1

.7

.2

Document 2

.3

.3

.4

Sample Multinomial Distributions over Words

Word w 1

Word w 2

……

Word w 1000

I

I

I

Topic T 1

.001

.021

……

.00006

Topic T 2

.0021

.006

……

.03

Topic T 3

.0065

.0043

……

.009

Sample Dictionary

Word w 1

Word w 2

……

Word w 1000

ability

abrasion

……

youth

Figure 3.2: Probability distributions for a sample LDA Topic Model of two documents, involving three

topics and a dictionary of one thousand words.

Assuming that the variable z ranges over the topics ( T 1 ,T 2 ,T 3 in our example), and the variable

w ranges over the words ( w 1 , ..., w 1000 in the example), we can refer to the topic-word distribution

and document-topic distribution as φ (j)

= P(w | z i = j) (one for each topic) and θ (d)

= P(z) (one

for each document). Now we can more formally specify the stochastic process that generates all the

words of document d in a collection: it consists of repeated samples from θ (d)

= P(z) to get a topic

and from φ (j)

= P(w | z i = j) to get a word given that topic.

For the more statistically inclined, an LDA model specifies the following distribution over

words within a document, by combining φ and θ distributions:

P(w i ) = j = 1 P(w i | z i = j)P(z i = j) ,

where T is the number of topics. P(w i |

j) is the probability of word w i under topic j and

P(z i = j) is the probability that j th topic was sampled for the i th word token.

The power of an LDA formalization is that such a model (i.e., all the probability distributions)

can be effectively learned for any given set of documents. Not only is there an efficient method to

z i =

Next Page

Methods for Mining and Summarizing Text Conversations

Search WWH ::

Custom Search

Home