Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

What's Beyond Bag-of-Words?

Bag-of-words is a common technique to start with. But sometimes the Data

Science team prefers other methods of text representation that are more

sophisticated. These more advanced methods consider factors such as word

order, context, inferences, and discourse. For example, one such method can

keep track of the word order of every document and compare the normalized

differences of the word orders [14]. These advanced techniques are outside the

scope of this topic.

Besides extracting the terms, their morphological features may need to be

included. The morphological features specify additional information about the

terms, which may include root words, affixes, part-of-speech tags, named entities,

or intonation (variations of spoken pitch). The features from this step contribute to

the downstream analysis in classification or sentiment analysis.

The set of features that need to be extracted and stored highly depends on the

specific task to be performed. If the task is to label and distinguish the part of

speech, for example, the features will include all the words in the text and their

corresponding part-of-speech tags. If the task is to annotate the named entities like

names and organizations, the features highlight such information appearing in the

text. Constructing the features is no trivial task; quite often this is done entirely

manually, and sometimes it requires domain expertise.

Sometimes creating features is a text analysis task all to itself. One such example is

topic modeling . Topic modeling provides a way to quickly analyze large volumes

of raw text and identify the latent topics. Topic modeling may not require the

documents to be labeled or annotated. It can discover topics directly from an

analysis of the raw text. A topic consists of a cluster of words that frequently occur

together and that share the same theme. Probabilistic topic modeling, discussed in

greater detail later in Section 9.6, is a suite of algorithms that aim to parse large

archives of documents and discover and annotate the topics.

It is important not only to create a representation of a document but also to create

a representation of a corpus. As introduced earlier in the chapter, a corpus is

a collection of documents. A corpus could be so large that it includes all the

documents in one or more languages, or it could be smaller or limited to a specific

domain, such as technology, medicine, or law. For a web search engine, the entire

Search WWH ::

Custom Search

Home