A Case Study in Natural Language Based Web Search - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

their antecedents [2]. For instance, when performing pronoun coreferencing, syntac-

tic agreement based on person, gender and number limits our search for a noun

phrase linked to a pronoun to a few candidates in the text. In addition, consistency

restrictions limit our search to a precise text span (the previous sentence, the pre-

ceding text in the current sentence, or the previous and current sentence) depending

upon whether the pronoun is personal, possessive, reflective, and what is its person.

In the sentence “John works by himself,” “himself” must refer to John, whereas in

“John bought him a new car,” “him” must refer to some other individual mentioned

in a previous sentence. In the sentence, ““You have not been sending money,” John

said in a recent call to his wife from Germany,” binding theory constraints limit pro-

noun resolution to first and second persons within a quotation (e.g., you), and the

candidate antecedent to a noun outside the quotation, which fits the grammatical

role of object of a verb or argument of a preposition (e.g., wife). Our coreferencing

and anaphora resolution models also benefit from preferential weighting based on

dependency attributes. The candidate antecedents that appear closer to a pronoun

in the text are scored higher (weighting by referential distance). Subject is favored

over object, except for accusative pronouns (weighting by syntactic position). A head

noun is favored over its modifiers (weighting by head label). In addition, as part of

the normalization process, we apply a transformational grammar to map multiple

surface structures into an equivalent deep structure. A common example is the nor-

malization of a dependency structure involving a passive verb form into the active,

and recognition of the deep subject of such clause. At the more pragmatic level, we

apply rules to normalize composite verb expressions, capture explicit and implicit

negations, or to verbalize noun or adjectives in cases where they convey action sense

in preference to the governing verb of a clause. For instance, the sentences “Bill did

not visit Jane,” which contains an explicit negation, and “Bill failed to visit Jane,”

where the negation is rendered by a composite verb expression, are mapped to the

same structure.

5.2.2 Storage

The output of a deep parser is a complex augmented tree structure that usually does

not lend itself to a tractable indexing schema for cross-document search. Therefore,

we have developed a set of rules for converting an augmented tree representation

into a scalable data storage structure.

In a dependency tree, every word in the sentence is a modifier of exactly one

other word (called its head), except the head word of the sentence, which does not

have a head. We use a list of tuples to specify a dependency tree with the following

format:

(Label Modifier Root POS Head-label Role Antecedent [Attributes])

where: Label is a unique numeric ID; Modifier is a term in the sentence; Root

is the root form (or category) of the modifier; POS is its lexical category; Head-

label is the ID of the term that modifier modifies; Role specifies the type of de-

pendency relationship between head and modifier, such as subject, complement, etc;

Antecedent is the antecedent of the modifier; Attributes is the list of semantic

attributes that may be associated with the modifier, e.g., person's name, location,

time, number, date, etc.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home