Database Reference
In-Depth Information
4.3.2 Practical Considerations
Here, we discuss some of the practical considerations in implementing the
algorithm of Figure 4.5 .
Precomputation. The computational bottleneck of the algorithm is com-
puting the Ψ function, which should be precomputed as much as possible. We
typically store E[log β k,w ] and E[log θ d,k ], only recomputing them when their
underlying variational parameters change.
Nested computation. In practice, we infer the per-document parame-
ters until convergence for each document before updating the topic estimates.
This amounts to repeating steps 2(a) and 2(b) of the algorithm for each docu-
ment before updating the topics themselves in step 1. For each per-document
variational update, we initialize γ d,k =1 /K .
Repeated updates for φ . Note that Eq. (4.10) is identical for each
occurrence of the term w n . Thus, we need not treat multiple instances of the
same word in the same document separately. The update for each instance of
the word is identical, and we need only compute it once for each unique term
in each document. The update in Eq. (4.9) can thus be written as
= α k + V
γ ( t +1)
d,k
v =1 n d,v φ ( t )
(4.14)
d,v
where n d,v is the number of occurrences of term v in document d .
This is a computational advantage of the mean field variational inference
algorithm over other approaches, allowing us to analyze very large document
collections.
Initialization and restarts. Since this algorithm finds a local maximum
of the variational objective function, initializing the topics is important. We
find that an effective initialization technique is to randomly choose a small
number (e.g., 1-5) of “seed” documents, create a distribution over words by
smoothing their aggregated word counts over the whole vocabulary, and from
thesecountscomputeafirstvalueforE[log β k,w ]. The inference algorithm
may be restarted multiple times, with different seed sets, to find a good local
maximum.
Choosing the vocabulary. It is often computationally expensive to use
the entire vocabulary. Choosing the top V words by TFIDF is an effective
way to prune the vocabulary. This naturally prunes out stop words and other
terms that provide little thematic content to the documents. In the Science
analysis above we chose the top 10,000 terms this way.
Choosing the number of topics. Choosing the number of topics is a
persistent problem in topic modeling and other latent variable analysis. In
some cases, the number of topics is part of the problem formulation and spec-
ified by an outside source. In other cases, a natural approach is to use cross
 
Search WWH ::




Custom Search