Topic Models - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

the collection that the hidden structure of the DTM gives.

At the topic level, each topic is now a sequence of distributions over terms.

Thus, for each topic and year, we can score the terms with Eq. (4.3) and

visualize the topic as a whole with its top words over time. This gives a

global sense of how the important words of a topic have changed through the

span of the collection. For individual terms of interest, we can examine their

score over time within each topic. We can also examine the overall popularity

of each topic from year to year by computing the expected number of words

that were assigned to it.

As an example, we used the DTM model to analyze the entire archive of

Science from 1880-2002. This corpus comprises 140,000 documents. We used

a vocabulary of 28,637 terms chosen by taking the union of the top 1000

terms by TFIDF for each year. Figure 4.9 illustrates the top words of two of

the topics taken every ten years, the scores of several of the most prevalent

words taken every year, the relative popularity of the two topics, and selected

articles that contain that topic. For sequential corpora such as Science ,the

DTM provides much richer exploratory tools than LDA or the CTM.

Finally, we note that the document similarity metric in Eq. (4.4) has inter-

esting properties in the context of the DTM. The metric is defined in terms

of the topic proportions for each document. For two documents in different

years, these proportions refer to two different slices of the K topics, but the

two sets of topics are linked together by the sequential model. Consequently,

themetricprovidesa time corrected notion of document similarity. Two ar-

ticles about biology might be deemed similar even if one uses the vocabulary

of 1910 and the other of 2002.

Figure 4.10 illustrates the top ten most similar articles to the 1994 Sci-

ence article “Automatic Analysis, Theme Generation, and Summarization of

Machine-Readable Texts.” This article is about ways of summarizing and

organizing large archives to manage the modern information explosion. As

expected, among the top ten most similar documents are articles from the

same era about many of the same topics. Other articles, however, such as

“Simple and Rapid Method for the Coding of Punched Cards,” (1962) are

also about organizing document information on punch cards. This uses a dif-

ferent language from the query article, but is arguably similar in that it is

about storing and organizing documents with the precursor to modern com-

puters. Even more striking among the top ten is “The Storing of Pamphlets”

(1899). This article addresses the information explosion problem—now con-

sidered quaint—at the turn of the century.

Search WWH ::

Custom Search

Home