Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

10.3.3 Experiments

A few thousand questions are available from the TREC 2000 Question

Answering Track, annotated with atypes (24). We identified 261 questions for

which the answer tokens prescribed by TREC included at least one instance

or subtype of the atype of the question. Some other questions had types like

reason (“Whyistheskyblue?”) and recipe (“How to bake banana bread?”)

that we cannot handle, or did not have any usable positive answer instances

because WordNet does not have a known is-a connection between the atype

and the answer token, e.g., WordNet does not know about the vast majority

of politicians or quantum physicists living today. For each question, we need

positive (answer) and negative (candidate but not answer) tokens, and, to

learn their distinction well, we should collect negative tokens that are “closest”

to the positive ones, i.e., strongly activated by selectors.

10.3.3.1

Data collection and preparation

Full atype index: We first indexed the corpus. Apart from a regular

Lucene (2) inverted index on stems, we prepared a full atype index on the

corpus, as follows. Each document is a sequence of tokens. Tokens can

be compound, such as New_York . An annotator module (see Figure 10.2 )

connects some tokens to nodes in the atype taxonomy, e.g., the string token

Einstein might be connected to both senses Einstein#n#1 (the specific

Physicist) and Einstein#n#2 (genius). (Disambiguation can be integrated

into the annotator module, but is an extensive research area in NLP (29) and

is outside our scope.)

We overrode Lucene's token scanner to look up WordNet once a token was

connected to one or more synsets, and walk up is-a (hypernym) links in the

WordNet type hierarchy. All synsets encountered as ancestors are regarded

as having occurred at the same token offset in the document as the original

token. In our running example, given the original token is Einstein ,wewould

regard physicist#n#1 , intellectual#n#1 , scientist#n#1 , person#n#1 ,

organism#n#1 , living_thing#n#1 , object#n#1 , causal_agent#n#1 ,

entity#n#1 as having occurred at the same token offset, and index all of these

as a separate field in Lucene. (This consumes a large amount of temporary

space, but we drastically reduce the space requirement in a second pass, see

Section 10.4 .)

Collecting labeled data for RankExp: We used the full atype index to

locate all candidate tokens, and made a generous estimate of the activation

from (the nearest occurrence of) each selector. This generous estimate used

the log IDF as energy and no decay , i.e., energy was accrued unattenuated at

the candidate position. For each query, we retained all positive answer tokens

and the 300 negative tokens with top scores. Overall, we finished with 169662

positive and negative contexts . 5-fold cross-validation (i.e., 80% training, 20%

Search WWH ::

Custom Search

Home