Database Reference
In-Depth Information
ally, this algorithm aims at minimizing an objective function; in this case, a squared
error function.
This is more iterative and a nonhierarchical method for data classification.
Text analysis
Text analysis is essentially the processing and representation of data that is in text
form for the purpose of analyzing and learning new models from it.
The main challenge in text analysis is the problem of high dimensionality. When ana-
lyzing a document, every possible word in the document represents a dimension.
The other major challenge with text analysis is that the data is unstructured.
The process or the problem-solving tasks in text analysis is composed of three im-
portant steps namely parsing, search/retrieval, and text mining.
Parsing is the process step that takes the unstructured or semi-structured document
and imposes a structure for the downstream analysis. Parsing is basically reading
the text which could be weblog, an RSS feed, an XML or HTML file, or a Word docu-
ment. Parsing decomposes what is read in, and renders it in a structure for the sub-
sequent steps.
Once parsing is done, the problem focuses on search and/or retrieval of specific
words or phrases or in finding a specific topic or an entity (a person or a corporation)
in a document or a corpus (body of knowledge). All text representation takes place
implicitly in the context of the corpus. All search and retrieval is something we are
used to performing with search engines such as Google. Most of the techniques
used in search and retrieval originated from the field of library science.
With the completion of these two steps, the output generated is a structured set of
tokens or a bunch of keywords that were searched, retrieved, and organized. The
third task is mining the text or understanding the content itself. Instead of treating the
text as a set of tokens or keywords, in this step we derive meaningful insights into
the data pertaining to the domain of knowledge, business process, or the problem
that we are trying to solve.
Search WWH ::




Custom Search