Database Reference
In-Depth Information
10.3.3 Experiments
A few thousand questions are available from the TREC 2000 Question
Answering Track, annotated with atypes (24). We identified 261 questions for
which the answer tokens prescribed by TREC included at least one instance
or subtype of the atype of the question. Some other questions had types like
reason (“Whyistheskyblue?”) and recipe (“How to bake banana bread?”)
that we cannot handle, or did not have any usable positive answer instances
because WordNet does not have a known is-a connection between the atype
and the answer token, e.g., WordNet does not know about the vast majority
of politicians or quantum physicists living today. For each question, we need
positive (answer) and negative (candidate but not answer) tokens, and, to
learn their distinction well, we should collect negative tokens that are “closest”
to the positive ones, i.e., strongly activated by selectors.
10.3.3.1
Data collection and preparation
Full atype index: We first indexed the corpus. Apart from a regular
Lucene (2) inverted index on stems, we prepared a full atype index on the
corpus, as follows. Each document is a sequence of tokens. Tokens can
be compound, such as New_York . An annotator module (see Figure 10.2 )
connects some tokens to nodes in the atype taxonomy, e.g., the string token
Einstein might be connected to both senses Einstein#n#1 (the specific
Physicist) and Einstein#n#2 (genius). (Disambiguation can be integrated
into the annotator module, but is an extensive research area in NLP (29) and
is outside our scope.)
We overrode Lucene's token scanner to look up WordNet once a token was
connected to one or more synsets, and walk up is-a (hypernym) links in the
WordNet type hierarchy. All synsets encountered as ancestors are regarded
as having occurred at the same token offset in the document as the original
token. In our running example, given the original token is Einstein ,wewould
regard physicist#n#1 , intellectual#n#1 , scientist#n#1 , person#n#1 ,
organism#n#1 , living_thing#n#1 , object#n#1 , causal_agent#n#1 ,
entity#n#1 as having occurred at the same token offset, and index all of these
as a separate field in Lucene. (This consumes a large amount of temporary
space, but we drastically reduce the space requirement in a second pass, see
Section 10.4 .)
Collecting labeled data for RankExp: We used the full atype index to
locate all candidate tokens, and made a generous estimate of the activation
from (the nearest occurrence of) each selector. This generous estimate used
the log IDF as energy and no decay , i.e., energy was accrued unattenuated at
the candidate position. For each query, we retained all positive answer tokens
and the 300 negative tokens with top scores. Overall, we finished with 169662
positive and negative contexts . 5-fold cross-validation (i.e., 80% training, 20%
 
Search WWH ::




Custom Search