Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

extracted from the proceedings of the European Parliament in 21 European

languages.

Most corpora come with metadata, such as the size of the corpus and the domains

from which the text is extracted. Some corpora (such as the Brown Corpus) include

the information content of every word appearing in the text. Information

content (IC) is a metric to denote the importance of a term in a corpus. The

conventional way [19] of measuring the IC of a term is to combine the knowledge of

its hierarchical structure from an ontology with statistics on its actual usage in text

derived from a corpus. Terms with higher IC values are considered more important

than terms with lower IC values. For example, the word necklace generally has

a higher IC value than the word jewelry in an English corpus because jewelry is

more general and is likely to appear more often than necklace . Research shows

that IC can help measure the semantic similarity of terms [20]. In addition, such

measures do not require an annotated corpus, and they generally achieve strong

correlations with human judgment [21, 20].

In the brand management example, the team has collected the ACME product

reviews and turned them into the proper representation with the techniques

discussed earlier. Next, the reviews and the representation need to be stored in a

searchable archive for future reference and research. This archive could be a SQL

database, XML or JSON files, or plain text files from one or more directories.

Corpus statistics such as IC can help identify the importance of a term from

the documents being analyzed. However, IC values included in the metadata of

a traditional corpus (such as Brown corpus) sitting externally as a knowledge

base cannot satisfy the need to analyze the dynamically changed, unstructured

data from the web. The problem is twofold. First, both traditional corpora and IC

metadata do not change over time. Any term not existing in the corpus text and

any newly invented words would automatically receive a zero IC value. Second,

the corpus represents the entire knowledge base for the algorithm being used in

the downstream analysis. The nature of the unstructured text determines that the

data being analyzed can contain any topics, many of which may be absent in the

given knowledge base. For example, if the task is to research people's attitudes on

musicians, a traditional corpus constructed 50 years ago would not know that the

term U2 is a band; therefore, it would receive a zero on IC, which means it's not

an important term. A better approach would go through all the fetched documents

and find out that most of them are related to music, with U2 appearing too often

to be an unimportant term. Therefore, it is necessary to come up with a metric

that can easily adapt to the context and nature of the text instead of relying on a

Search WWH ::

Custom Search

Home