Database Reference
In-Depth Information
9.1 Text Analysis Steps
A text analysis problem usually consists of three important steps: parsing, search
and retrieval, and text mining. Note that a text analysis problem may also consist of
other subtasks (such as discourse and segmentation) that are outside the scope of
this topic.
Parsing is the process that takes unstructured text and imposes a structure for
further analysis. The unstructured text could be a plain text file, a weblog, an
Extensible Markup Language (XML) file, a HyperText Markup Language (HTML)
file, or a Word document. Parsing deconstructs the provided text and renders it in a
more structured way for the subsequent steps.
Search and retrieval is the identification of the documents in a corpus that
contain search items such as specific words, phrases, topics, or entities like people
or organizations. These search items are generally called key terms . Search and
retrieval originated from the field of library science and is now used extensively by
web search engines.
Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest. With
the proper representation of the text, many of the techniques mentioned in the
previous chapters, such as clustering and classification, can be adapted to text
mining. For example, the k -means from Chapter 4, “Advanced Analytical Theory
and Methods: Clustering,” can be modified to cluster text documents into groups,
where each group represents a collection of documents with a similar topic [2]. The
distance of a document to a centroid represents how closely the document talks
about that topic. Classification tasks such as sentiment analysis and spam filtering
are prominent use cases for the naïve Bayes classifier (Chapter 7, “Advanced
Analytical Theory and Methods: Classification”). Text mining may utilize methods
and techniques from various fields of study, such as statistical analysis, information
retrieval, data mining, and natural language processing.
Note that, in reality, all three steps do not have to be present in a text analysis
project. If the goal is to construct a corpus or provide a catalog service, for example,
the focus would be the parsing task using one or more text preprocessing
techniques, such as part-of-speech (POS) tagging, named entity recognition,
lemmatization, or stemming. Furthermore, the three tasks do not have to be
sequential. Sometimes their orders might even look like a tree. For example, one
Search WWH ::




Custom Search