Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

A topic is formally defined as a distribution over a fixed vocabulary of words

[29]. Different topics would have different distributions over the same vocabulary.

A topic can be viewed as a cluster of words with related meanings, and each

word has a corresponding weight inside this topic. Note that a word from the

vocabulary can reside in multiple topics with different weights. Topic models do

not necessarily require prior knowledge of the texts. The topics can emerge solely

based on analyzing the text.

The simplest topic model is latent Dirichlet allocation (LDA) [29], a

generative probabilistic model of a corpus proposed by David M. Blei and two

other researchers. In generative probabilistic modeling, data is treated as the result

of a generative process that includes hidden variables. LDA assumes that there

is a fixed vocabulary of words, and the number of the latent topics is predefined

and remains constant. LDA assumes that each latent topic follows a Dirichlet

distribution [30] over the vocabulary, and each document is represented as a

random mixture of latent topics.

Figure 9.4 illustrates the intuitions behind LDA. The left side of the figure shows

four topics built from a corpus, where each topic contains a list of the most

important words from the vocabulary. The four example topics are related to

problem, policy, neural, and report. For each document, a distribution over the

topics is chosen, as shown in the histogram on the right. Next, a topic assignment is

picked for each word in the document, and the word from the corresponding topic

(colored discs) is chosen. In reality, only the documents (as shown in the middle

of the figure) are available. The goal of LDA is to infer the underlying topics, topic

proportions, and topic assignments for every document.

Search WWH ::

Custom Search

Home