Topic Models - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

4.3.2 Practical Considerations

Here, we discuss some of the practical considerations in implementing the

algorithm of Figure 4.5 .

Precomputation. The computational bottleneck of the algorithm is com-

puting the Ψ function, which should be precomputed as much as possible. We

typically store E[log β k,w ] and E[log θ d,k ], only recomputing them when their

underlying variational parameters change.

Nested computation. In practice, we infer the per-document parame-

ters until convergence for each document before updating the topic estimates.

This amounts to repeating steps 2(a) and 2(b) of the algorithm for each docu-

ment before updating the topics themselves in step 1. For each per-document

variational update, we initialize γ d,k =1 /K .

Repeated updates for φ . Note that Eq. (4.10) is identical for each

occurrence of the term w n . Thus, we need not treat multiple instances of the

same word in the same document separately. The update for each instance of

the word is identical, and we need only compute it once for each unique term

in each document. The update in Eq. (4.9) can thus be written as

= α k + V

γ ( t +1)

d,k

v =1 n d,v φ ( t )

(4.14)

d,v

where n d,v is the number of occurrences of term v in document d .

This is a computational advantage of the mean field variational inference

algorithm over other approaches, allowing us to analyze very large document

collections.

Initialization and restarts. Since this algorithm finds a local maximum

of the variational objective function, initializing the topics is important. We

find that an effective initialization technique is to randomly choose a small

number (e.g., 1-5) of “seed” documents, create a distribution over words by

smoothing their aggregated word counts over the whole vocabulary, and from

thesecountscomputeafirstvalueforE[log β k,w ]. The inference algorithm

may be restarted multiple times, with different seed sets, to find a good local

maximum.

Choosing the vocabulary. It is often computationally expensive to use

the entire vocabulary. Choosing the top V words by TFIDF is an effective

way to prune the vocabulary. This naturally prunes out stop words and other

terms that provide little thematic content to the documents. In the Science

analysis above we chose the top 10,000 terms this way.

Choosing the number of topics. Choosing the number of topics is a

persistent problem in topic modeling and other latent variable analysis. In

some cases, the number of topics is part of the problem formulation and spec-

ified by an outside source. In other cases, a natural approach is to use cross

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home