Database Reference
In-Depth Information
10.3.3 Experiments
A few thousand questions are available from the TREC 2000 Question
Answering Track, annotated with atypes (24). We identified 261 questions for
which the answer tokens prescribed by TREC included at least one instance
or subtype of the atype of the question. Some other questions had types like
reason
(“Whyistheskyblue?”) and
recipe
(“How to bake banana bread?”)
that we cannot handle, or did not have any usable positive answer instances
because WordNet does not have a known is-a connection between the atype
and the answer token, e.g., WordNet does not know about the vast majority
of politicians or quantum physicists living today. For each question, we need
positive (answer) and negative (candidate but not answer) tokens, and, to
learn their distinction well, we should collect negative tokens that are “closest”
to the positive ones, i.e., strongly activated by selectors.
10.3.3.1
Data collection and preparation
Full atype index:
We first indexed the corpus. Apart from a regular
Lucene (2) inverted index on stems, we prepared a full
atype index
on the
corpus, as follows. Each document is a sequence of tokens. Tokens can
connects some tokens to nodes in the atype taxonomy, e.g., the string token
Einstein
might be connected to both senses
Einstein#n#1
(the specific
Physicist) and
Einstein#n#2
(genius). (Disambiguation can be integrated
into the annotator module, but is an extensive research area in NLP (29) and
is outside our scope.)
We overrode Lucene's token scanner to look up WordNet once a token was
connected to one or more synsets, and walk up is-a (hypernym) links in the
WordNet type hierarchy. All synsets encountered as ancestors are regarded
as having occurred at the same token offset in the document as the original
token. In our running example, given the original token is
Einstein
,wewould
regard
physicist#n#1
,
intellectual#n#1
,
scientist#n#1
,
person#n#1
,
organism#n#1
,
living_thing#n#1
,
object#n#1
,
causal_agent#n#1
,
entity#n#1
as having occurred at the same token offset, and index all of these
as a separate field in Lucene. (This consumes a large amount of temporary
space, but we drastically reduce the space requirement in a second pass, see
Section 10.4
.)
Collecting labeled data for RankExp:
We used the full atype index to
locate all candidate tokens, and made a generous estimate of the activation
from (the nearest occurrence of) each selector. This generous estimate used
the log IDF as
energy
and no
decay
, i.e.,
energy
was accrued unattenuated at
the candidate position. For each query, we retained all positive answer tokens
and the 300 negative tokens with top scores. Overall, we finished with 169662
positive and negative
contexts
. 5-fold cross-validation (i.e., 80% training, 20%
Search WWH ::
Custom Search