Database Reference
In-Depth Information
Key Concepts
Term
Corpus
Text normalization
TFIDF
Topic modeling
Sentiment analysis
Text analysis, sometimes called text analytics, refers to the representation,
processing, and modeling of textual data to derive useful insights. An important
component of text analysis is text mining, the process of discovering relationships
and interesting patterns in large text collections.
Text analysis suffers from the curse of high dimensionality. Take the popular
children's topic Green Eggs and Ham [1] as an example. Author Theodor Geisel
(Dr. Seuss) was challenged to write an entire topic with just 50 distinct words. He
responded with the topic Green Eggs and Ham , which contains 804 total words,
only 50 of them distinct. These 50 words are:
a, am, and, anywhere, are, be, boat, box, car, could, dark, do, eat, eggs, fox,
goat, good, green, ham, here, house, I, if, in, let, like, may, me, mouse, not,
on, or, rain, Sam, say, see, so, thank, that, the, them, there, they, train, tree,
try, will, with, would, you
There's a substantial amount of repetition in the topic. Yet, as repetitive as the topic
is, modeling it as a vector of counts, or features, for each distinct word still results
in a 50-dimension problem.
Green Eggs and Ham is a simple topic. Text analysis often deals with textual data
that is far more complex. A corpus (plural: corpora) is a large collection of texts
used for various purposes in Natural Language Processing (NLP). Table 9.1 lists a
few example corpora that are commonly used in NLP research.
Search WWH ::




Custom Search