Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

are given different weights based on their rareness in the corpus (rare words

get a larger weight), and some query words are eliminated because they are

stopwords like the and an , but otherwise all query words are treated equally

while computing the similarity between q and d . Such a scoring scheme does

not work for us, because the atype or informer tokens are fundamentally

different from the selector tokens in their purpose, and have to be treated

very differently by the scoring function. Second, vector-space scoring evolved

over more than a decade, and the scoring choices are backed by probabilistic

arguments (37).

But for scoring snippets, no such guiding principles are

available.

In this section, we will first set up a parametric scoring model based on the

lexical proximity between occurrences of instances of the question atype and

occurrences of question selectors in short snippets in the corpus. We will then

set up a learning problem to estimate the parameters of the scoring function

from training data. Finally, we will describe our experiences with some TREC

question answering benchmarks.

10.3.1 A Proximity Model

Consider the query “Who invented television?” which translates to atype

person#n#1 and (after stemming) selectors television and invent* (meaning

any sux of invent is to be matched). Figure 10.13 shows a sample snippet

that contains the answer at (relative) token offset 0.

The answer token is a descendant of the node person#n#1 in WordNet.

John Baird may not be explicitly coded into the WordNet database as a

person, but a great deal of work on information extraction and named entity

tagging (35) has produced reliable automated annotators that can connect

the segment John Baird to the type node person#n#1 in WordNet.

If the candidate (compound) token w = John Baird is assigned relative

offset 0, the selector stems are at token offsets

1 in Figure 10.13.

We will take an activation spreading approach to scoring token position 0.

Each occurrence of a selector s gets an infusion of energy, energy ( s )and

radiates it out along the linear token sequence, in both directions. The gap

between candidate position w and a selector occurrence is denoted gap ( w, s ).

The selector occurrence s transfers

−

6,

−

4and

−

energy ( s ) decay ( gap ( w, s ))

to the candidate token. The gap between a candidate token w and a matched

selector s , called gap ( w, s ), is one plus the number of intervening tokens.

decay ( g ) is a suitable function of the gap g .

10.3.1.1

energy and decay

Each matched selector s has an associated positive number called its

energy , denoted energy ( s ). Acommonnotionofenergyistheinverse

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home