Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

9.1 Text Analysis Steps

A text analysis problem usually consists of three important steps: parsing, search

and retrieval, and text mining. Note that a text analysis problem may also consist of

other subtasks (such as discourse and segmentation) that are outside the scope of

this topic.

Parsing is the process that takes unstructured text and imposes a structure for

further analysis. The unstructured text could be a plain text file, a weblog, an

Extensible Markup Language (XML) file, a HyperText Markup Language (HTML)

file, or a Word document. Parsing deconstructs the provided text and renders it in a

more structured way for the subsequent steps.

Search and retrieval is the identification of the documents in a corpus that

contain search items such as specific words, phrases, topics, or entities like people

or organizations. These search items are generally called key terms . Search and

retrieval originated from the field of library science and is now used extensively by

web search engines.

Text mining uses the terms and indexes produced by the prior two steps to

discover meaningful insights pertaining to domains or problems of interest. With

the proper representation of the text, many of the techniques mentioned in the

previous chapters, such as clustering and classification, can be adapted to text

mining. For example, the k -means from Chapter 4, “Advanced Analytical Theory

and Methods: Clustering,” can be modified to cluster text documents into groups,

where each group represents a collection of documents with a similar topic [2]. The

distance of a document to a centroid represents how closely the document talks

about that topic. Classification tasks such as sentiment analysis and spam filtering

are prominent use cases for the naïve Bayes classifier (Chapter 7, “Advanced

Analytical Theory and Methods: Classification”). Text mining may utilize methods

and techniques from various fields of study, such as statistical analysis, information

retrieval, data mining, and natural language processing.

Note that, in reality, all three steps do not have to be present in a text analysis

project. If the goal is to construct a corpus or provide a catalog service, for example,

the focus would be the parsing task using one or more text preprocessing

techniques, such as part-of-speech (POS) tagging, named entity recognition,

lemmatization, or stemming. Furthermore, the three tasks do not have to be

sequential. Sometimes their orders might even look like a tree. For example, one

Search WWH ::

Custom Search

Home