Database Reference
In-Depth Information
the collection that the hidden structure of the DTM gives.
At the topic level, each topic is now a sequence of distributions over terms.
Thus, for each topic and year, we can score the terms with Eq. (4.3) and
visualize the topic as a whole with its top words over time. This gives a
global sense of how the important words of a topic have changed through the
span of the collection. For individual terms of interest, we can examine their
score over time within each topic. We can also examine the overall popularity
of each topic from year to year by computing the expected number of words
that were assigned to it.
As an example, we used the DTM model to analyze the entire archive of
Science from 1880-2002. This corpus comprises 140,000 documents. We used
a vocabulary of 28,637 terms chosen by taking the union of the top 1000
terms by TFIDF for each year. Figure 4.9 illustrates the top words of two of
the topics taken every ten years, the scores of several of the most prevalent
words taken every year, the relative popularity of the two topics, and selected
articles that contain that topic. For sequential corpora such as Science ,the
DTM provides much richer exploratory tools than LDA or the CTM.
Finally, we note that the document similarity metric in Eq. (4.4) has inter-
esting properties in the context of the DTM. The metric is defined in terms
of the topic proportions for each document. For two documents in different
years, these proportions refer to two different slices of the K topics, but the
two sets of topics are linked together by the sequential model. Consequently,
themetricprovidesa time corrected notion of document similarity. Two ar-
ticles about biology might be deemed similar even if one uses the vocabulary
of 1910 and the other of 2002.
Figure 4.10 illustrates the top ten most similar articles to the 1994 Sci-
ence article “Automatic Analysis, Theme Generation, and Summarization of
Machine-Readable Texts.” This article is about ways of summarizing and
organizing large archives to manage the modern information explosion. As
expected, among the top ten most similar documents are articles from the
same era about many of the same topics. Other articles, however, such as
“Simple and Rapid Method for the Coding of Punched Cards,” (1962) are
also about organizing document information on punch cards. This uses a dif-
ferent language from the query article, but is arguably similar in that it is
about storing and organizing documents with the precursor to modern com-
puters. Even more striking among the top ten is “The Storing of Pamphlets”
(1899). This article addresses the information explosion problem—now con-
sidered quaint—at the turn of the century.
Search WWH ::




Custom Search