Database Reference
In-Depth Information
computer
chemistry
cortex
orbit
infection
methods
synthesis
stimulus
dust
immune
number
oxidation
fig
jupiter
aids
two
reaction
vision
line
infected
principle
product
neuron
system
viral
design
organic
recordings
solar
cells
access
conditions
visual
gas
vaccine
processing
cluster
stimuli
atmospheric
antibodies
advantage
molecule
recorded
mars
hiv
important
studies
motor
field
parasite
FIGURE 4.1 : Five topics from a 50-topic LDA model fit to Science from
1980-2002.
With the statistical tools that we describe below, we can automatically
organize electronic archives to facilitate ecient browsing and exploring. As
a running example, we will analyze JSTOR's archive of the journal Science .
Figure 4.1 illustrates five “topics” (i.e., highly probable words) that were
discovered automatically from this collection using the simplest topic model,
latent Dirichlet allocation (LDA) (10) (see Section 4.2 ). Further embellishing
LDA allows us to discover connected topics ( Figure 4.7 ) and trends within
topics ( Figure 4.9 ). We emphasize that these algorithms have no prior notion
of the existence of the illustrated themes, such as neuroscience or genetics.
The themes are automatically discovered from analyzing the original texts
This chapter is organized as follows. In Section 4.2 we discuss the LDA
model and illustrate how to use its posterior distribution as an exploratory tool
for large corpora. In Section 4.3, we describe how to effectively approximate
that posterior with mean field variational methods. In Section 4.4, we relax
two of the implicit assumptions that LDA makes to find maps of related
topics and model topics changing through time. Again, we illustrate how
these extensions facilitate understanding and exploring the latent structure of
modern corpora.
4.2 Latent Dirichlet Allocation
In this section we describe latent Dirichlet allocation (LDA), which has
served as a springboard for many other topic models. LDA is based on seminal
work in latent semantic indexing (LSI) (12) and probabilistic LSI (20). The
relationship between these techniques is clearly described in (33). Here, we
develop LDA from the principles of generative probabilistic models.
 
Search WWH ::




Custom Search