Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Key Concepts

Term

Corpus

Text normalization

TFIDF

Topic modeling

Sentiment analysis

Text analysis, sometimes called text analytics, refers to the representation,

processing, and modeling of textual data to derive useful insights. An important

component of text analysis is text mining, the process of discovering relationships

and interesting patterns in large text collections.

Text analysis suffers from the curse of high dimensionality. Take the popular

children's topic Green Eggs and Ham [1] as an example. Author Theodor Geisel

(Dr. Seuss) was challenged to write an entire topic with just 50 distinct words. He

responded with the topic Green Eggs and Ham , which contains 804 total words,

only 50 of them distinct. These 50 words are:

a, am, and, anywhere, are, be, boat, box, car, could, dark, do, eat, eggs, fox,

goat, good, green, ham, here, house, I, if, in, let, like, may, me, mouse, not,

on, or, rain, Sam, say, see, so, thank, that, the, them, there, they, train, tree,

try, will, with, would, you

There's a substantial amount of repetition in the topic. Yet, as repetitive as the topic

is, modeling it as a vector of counts, or features, for each distinct word still results

in a 50-dimension problem.

Green Eggs and Ham is a simple topic. Text analysis often deals with textual data

that is far more complex. A corpus (plural: corpora) is a large collection of texts

used for various purposes in Natural Language Processing (NLP). Table 9.1 lists a

few example corpora that are commonly used in NLP research.

Search WWH ::

Custom Search

Home