Database Reference
In-Depth Information
9.6 Categorizing Documents by Topics
With the reviews collected and represented, the data science team at ACME wants to
categorize the reviews by topics. As discussed earlier in the chapter, a topic consists
of a cluster of words that frequently occur together and share the same theme.
The topics of a document are not as straightforward as they might initially appear.
Consider these two reviews:
1. The bPhone5x has coverage everywhere. It's much less flaky than my old
bPhone4G.
2. While I love ACME's bPhone series, I've been quite disappointed by the
bEbook. The text is illegible, and it makes even my old NBook look
blazingly fast.
Is the first review about bPhone5x or bPhone4G? Is the second review about
bPhone, bEbook, or NBook? For machines, these questions can be difficult to
answer.
Intuitively, if a review is talking about bPhone5x, the term bPhone5x and related
terms (such as phone and ACME ) are likely to appear frequently. A document
typically consists of multiple themes running through the text in different
proportions—for example, 30% on a topic related to phones , 15% on a topic related
to appearance , 10% on a topic related to shipping , 5% on a topic related to
service , and so on.
Document grouping can be achieved with clustering methods such as k -means
clustering [24] or classification methods such as support vector machines [25],
k -nearest neighbors [26], or naïve Bayes [27]. However, a more feasible and
prevalent approach is to use topic modeling . Topic modeling provides tools to
automatically organize, search, understand, and summarize from vast amounts of
information. Topic models [28, 29] are statistical models that examine words
from a set of documents, determine the themes over the text, and discover how the
themes are associated or change over time. The process of topic modeling can be
simplified to the following.
1. Uncover the hidden topical patterns within a corpus.
2. Annotate documents according to these topics.
3. Use annotations to organize, search, and summarize texts.
Search WWH ::




Custom Search