Database Reference
In-Depth Information
extracted from the proceedings of the European Parliament in 21 European
languages.
Most corpora come with metadata, such as the size of the corpus and the domains
from which the text is extracted. Some corpora (such as the Brown Corpus) include
the information content of every word appearing in the text. Information
content (IC) is a metric to denote the importance of a term in a corpus. The
conventional way [19] of measuring the IC of a term is to combine the knowledge of
its hierarchical structure from an ontology with statistics on its actual usage in text
derived from a corpus. Terms with higher IC values are considered more important
than terms with lower IC values. For example, the word necklace generally has
a higher IC value than the word jewelry in an English corpus because jewelry is
more general and is likely to appear more often than necklace . Research shows
that IC can help measure the semantic similarity of terms [20]. In addition, such
measures do not require an annotated corpus, and they generally achieve strong
correlations with human judgment [21, 20].
In the brand management example, the team has collected the ACME product
reviews and turned them into the proper representation with the techniques
discussed earlier. Next, the reviews and the representation need to be stored in a
searchable archive for future reference and research. This archive could be a SQL
database, XML or JSON files, or plain text files from one or more directories.
Corpus statistics such as IC can help identify the importance of a term from
the documents being analyzed. However, IC values included in the metadata of
a traditional corpus (such as Brown corpus) sitting externally as a knowledge
base cannot satisfy the need to analyze the dynamically changed, unstructured
data from the web. The problem is twofold. First, both traditional corpora and IC
metadata do not change over time. Any term not existing in the corpus text and
any newly invented words would automatically receive a zero IC value. Second,
the corpus represents the entire knowledge base for the algorithm being used in
the downstream analysis. The nature of the unstructured text determines that the
data being analyzed can contain any topics, many of which may be absent in the
given knowledge base. For example, if the task is to research people's attitudes on
musicians, a traditional corpus constructed 50 years ago would not know that the
term U2 is a band; therefore, it would receive a zero on IC, which means it's not
an important term. A better approach would go through all the fetched documents
and find out that most of them are related to music, with U2 appearing too often
to be an unimportant term. Therefore, it is necessary to come up with a metric
that can easily adapt to the context and nature of the text instead of relying on a
Search WWH ::




Custom Search