Database Reference
In-Depth Information
What's Beyond Bag-of-Words?
Bag-of-words is a common technique to start with. But sometimes the Data
Science team prefers other methods of text representation that are more
sophisticated. These more advanced methods consider factors such as word
order, context, inferences, and discourse. For example, one such method can
keep track of the word order of every document and compare the normalized
differences of the word orders [14]. These advanced techniques are outside the
scope of this topic.
Besides extracting the terms, their morphological features may need to be
included. The morphological features specify additional information about the
terms, which may include root words, affixes, part-of-speech tags, named entities,
or intonation (variations of spoken pitch). The features from this step contribute to
the downstream analysis in classification or sentiment analysis.
The set of features that need to be extracted and stored highly depends on the
specific task to be performed. If the task is to label and distinguish the part of
speech, for example, the features will include all the words in the text and their
corresponding part-of-speech tags. If the task is to annotate the named entities like
names and organizations, the features highlight such information appearing in the
text. Constructing the features is no trivial task; quite often this is done entirely
manually, and sometimes it requires domain expertise.
Sometimes creating features is a text analysis task all to itself. One such example is
topic modeling . Topic modeling provides a way to quickly analyze large volumes
of raw text and identify the latent topics. Topic modeling may not require the
documents to be labeled or annotated. It can discover topics directly from an
analysis of the raw text. A topic consists of a cluster of words that frequently occur
together and that share the same theme. Probabilistic topic modeling, discussed in
greater detail later in Section 9.6, is a suite of algorithms that aim to parse large
archives of documents and discover and annotate the topics.
It is important not only to create a representation of a document but also to create
a representation of a corpus. As introduced earlier in the chapter, a corpus is
a collection of documents. A corpus could be so large that it includes all the
documents in one or more languages, or it could be smaller or limited to a specific
domain, such as technology, medicine, or law. For a web search engine, the entire
Search WWH ::




Custom Search