Database Reference
In-Depth Information
are given different weights based on their rareness in the corpus (rare words
get a larger weight), and some query words are eliminated because they are
stopwords like the and an , but otherwise all query words are treated equally
while computing the similarity between q and d . Such a scoring scheme does
not work for us, because the atype or informer tokens are fundamentally
different from the selector tokens in their purpose, and have to be treated
very differently by the scoring function. Second, vector-space scoring evolved
over more than a decade, and the scoring choices are backed by probabilistic
arguments (37).
But for scoring snippets, no such guiding principles are
available.
In this section, we will first set up a parametric scoring model based on the
lexical proximity between occurrences of instances of the question atype and
occurrences of question selectors in short snippets in the corpus. We will then
set up a learning problem to estimate the parameters of the scoring function
from training data. Finally, we will describe our experiences with some TREC
question answering benchmarks.
10.3.1 A Proximity Model
Consider the query “Who invented television?” which translates to atype
person#n#1 and (after stemming) selectors television and invent* (meaning
any sux of invent is to be matched). Figure 10.13 shows a sample snippet
that contains the answer at (relative) token offset 0.
The answer token is a descendant of the node person#n#1 in WordNet.
John Baird may not be explicitly coded into the WordNet database as a
person, but a great deal of work on information extraction and named entity
tagging (35) has produced reliable automated annotators that can connect
the segment John Baird to the type node person#n#1 in WordNet.
If the candidate (compound) token w = John Baird is assigned relative
offset 0, the selector stems are at token offsets
1 in Figure 10.13.
We will take an activation spreading approach to scoring token position 0.
Each occurrence of a selector s gets an infusion of energy, energy ( s )and
radiates it out along the linear token sequence, in both directions. The gap
between candidate position w and a selector occurrence is denoted gap ( w, s ).
The selector occurrence s transfers
6,
4and
energy ( s ) decay ( gap ( w, s ))
to the candidate token. The gap between a candidate token w and a matched
selector s , called gap ( w, s ), is one plus the number of intervening tokens.
decay ( g ) is a suitable function of the gap g .
10.3.1.1
energy and decay
Each matched selector s has an associated positive number called its
energy , denoted energy ( s ). Acommonnotionofenergyistheinverse
Search WWH ::




Custom Search