Database Reference
In-Depth Information
A topic is formally defined as a distribution over a fixed vocabulary of words
[29]. Different topics would have different distributions over the same vocabulary.
A topic can be viewed as a cluster of words with related meanings, and each
word has a corresponding weight inside this topic. Note that a word from the
vocabulary can reside in multiple topics with different weights. Topic models do
not necessarily require prior knowledge of the texts. The topics can emerge solely
based on analyzing the text.
The simplest topic model is latent Dirichlet allocation (LDA) [29], a
generative probabilistic model of a corpus proposed by David M. Blei and two
other researchers. In generative probabilistic modeling, data is treated as the result
of a generative process that includes hidden variables. LDA assumes that there
is a fixed vocabulary of words, and the number of the latent topics is predefined
and remains constant. LDA assumes that each latent topic follows a Dirichlet
distribution [30] over the vocabulary, and each document is represented as a
random mixture of latent topics.
Figure 9.4 illustrates the intuitions behind LDA. The left side of the figure shows
four topics built from a corpus, where each topic contains a list of the most
important words from the vocabulary. The four example topics are related to
problem, policy, neural, and report. For each document, a distribution over the
topics is chosen, as shown in the histogram on the right. Next, a topic assignment is
picked for each word in the document, and the word from the corresponding topic
(colored discs) is chosen. In reality, only the documents (as shown in the middle
of the figure) are available. The goal of LDA is to infer the underlying topics, topic
proportions, and topic assignments for every document.
Search WWH ::




Custom Search