Topic Models - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Query

Automatic Analysis, Theme Generation, and Summarization

of Machine-Readable Texts (1994)

1

Global Text Matching for Information Retrieval (1991)

2

Automatic Text Analysis (1970)

3

Language-Independent Categorization of Text (1995)

4

Developments in Automatic Text Retrieval (1991)

5

Simple and Rapid Method for the Coding of Punched Cards (1962)

6

Data Processing by Optical Coincidence (1961)

7

Pattern-Analyzing Memory (1976)

8

The Storing of Pamphlets (1899)

9

A Punched-Card Technique for Computing Means (1946)

10

Database Systems (1982)

FIGURE 4.10 : The top ten most similar articles to the query in Science

(1880-2002), scored by Eq. (4.4) using the posterior distribution from the

dynamic topic model.

4.5 Discussion

We have described and discussed latent Dirichlet allocation and its applica-

tion to decomposing and exploring a large collection of documents. We have

also described two extensions: one allowing correlated occurrence of topics

and one allowing topics to evolve through time. We have seen how topic

modeling can provide a useful view of a large collection in terms of the collec-

tion as a whole, the individual documents, and the relationships between the

documents.

There are several advantages of the generative probabilistic approach to

topic modeling, as opposed to a non-probabilistic method like LSI (12) or

non-negative matrix factorization (23). First, generative models are easily

applied to new data. This is essential for applications to tasks like information

retrieval or classification. Second, generative models are modular ;theycan

easily be used as a component in more complicated topic models. For example,

LDA has been used in models of authorship (42), syntax (19), and meeting

discourse (29). Finally, generative models are general in the sense that the

observation emission probabilities need not be discrete. Instead of words,

LDA-like models have been used to analyze images (15; 32; 6; 4), population

genetics data (28), survey data (13), and social networks data (1).

We conclude with a word of caution. The topics and topical decomposition

found with LDA and other topic models are not “definitive.” Fitting a topic

model to a collection will yield patterns within the corpus whether or not they

are “naturally” there. (And starting the procedure from a different place will

yield different patterns!)

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home