Information Technology Reference
In-Depth Information
are unsuitable for deep parsing. Last, we extract morphological stems and compute
frequency counts, which are then entered in the index.
Clause Processing
The indexing service takes the output of the sentence splitter and feeds it to a
deep linguistic parser. A sentence may consist of multiple clauses. Unlike traditional
models that store only term frequency distributions, InFact performs clause level
indexing and captures syntactic category and roles for each term, and grammatical
constructs, relationships, and inter-clause links that enable it to understand events.
One strong differentiator of our approach to information extraction [4, 5, 7, 8, 14, 19]
is that we create these indices automatically, without using predefined extraction
rules, and we capture all information, not just predefined patterns. Our parser per-
forms a full constituency and dependency analysis, extracting part-of-speech (POS)
tags and grammatical roles for all tokens in every clause. In the process, tokens
undergo grammatical stemming and an optional, additional level of tagging. For
instance, when performing grammatical stemming on verb forms, we normalize to
the infinitive, but we may retain temporal tags (e.g., past, present, future), aspect
tags (e.g., progressive, perfect), mood/modality tags (e.g., possibility, subjunctive,
irrealis, negated, conditional, causal) for later use in search.
Next we capture inter-clause links, through: 1) explicit tagging of conjunctions
or pronouns that provide the link between the syntactic structures for two adjacent
clauses in the same sentence; and 2) pointing to the list of annotated keywords in the
antecedent and following sentence. Note that the second mechanism ensures good
recall in those instances where the parser fails to produce a full parse tree for long
and convoluted sentences, or information about an event is spread across adjacent
sentences. In addition, appositive clauses are recognized, split into separate clauses
and cross-referenced to the parent clause.
For instance, the sentence: “Appointed commander of the Continental Army
in 1775, George Washington molded a fighting force that eventually won indepen-
dence from Great Britain” consists of three clauses, each containing a governing
verb (appoint, mold, and win). InFact decomposes it into a primary clause (“George
Washington molded a fighting force”) and two secondary clauses, which are related
to the primary clause by an appositive construct (“Appointed commander of the
Continental Army in 1775”) and a pronoun (“that eventually won independence
from Great Britain”), respectively. Each term in each clause is assigned a syntac-
tic category or POS tag (e.g., noun, adjective, etc.) and a grammatical role tag
(e.g., subject, object, etc.). InFact then utilizes these linguistic tags to extract re-
lationships that are normalized and stored in an index, as outlined in the next two
sections.
Linguistic Normalization
We apply normalization rules at the syntactic, semantic, or even pragmatic level.
Our approach to coreferencing and anaphora resolution make use of syntactic agree-
ment and/or binding theory constraints, as well as modeling of referential distance,
syntactic position, and head noun [6, 10, 12, 13, 16, 17]. Binding theory places syn-
tactic restrictions on the possible coreference relationships between pronouns and
Search WWH ::




Custom Search