Database Reference
In-Depth Information
9.4.1.1
Automating evaluations based on answer keys
Automatic evaluation methods would allow for faster system building and
tuning, as well as provide an objective and affordable way of comparing various
systems. Recently, such methods have been proposed, more or less, based on
the idea of n-gram co-occurrences. Pourpre (13) assigns a fractional recall
score to a system response based on its unigram overlap with a given nugget's
description. For example, a system response 'ABC' has recall 3/4 with
respect to a nugget with description 'ABCD .' However, such an approach
is unfair to systems that present the same information but using words other
than A, B, C and D . Another open issue is how to weight individual words
in measuring the closeness of a match. For example, consider the question
“How many prisoners escaped?” In the nugget 'Seven prisoners escaped from
a Texas prison,' there is no indication that 'seven' is the keyword, and that
it must be matched to get any relevance credit. Using IDF values does not
help, since 'seven' will generally not have a higher IDF than words like 'texas'
and 'prison' - an observation of ours supported by the results reported by
the authors of Pourpre. Also, redefining the nugget as just 'seven' does not
solve the problem since now it might spuriously match any mention of 'seven'
out of context. Nuggeteer (16) works on similar principles but makes binary
decisions about whether a nugget is present in a given system response by
tuning a threshold. However, it is also plagued by 'spurious relevance' since
not all words of the nugget description (or known correct responses) are central
to the nugget.
9.4.1.2
Nugget-matching rules
We propose a reliable automatic method for determining whether a snippet
of text contains a given nugget, based on nugget-matching rules ,whichare
generated using a semi-automatic procedure explained below. These rules are
essentially boolean queries that will only match against snippets that contain
the nugget. For instance, a candidate rule for matching answers to “How many
prisoners escaped?” is (Texas AND seven AND escape AND (convicts OR
prisoners)) , possibly with other synonyms and variants in the rule. For
a corpus of news articles, which usually follow a typical formal prose, it is
surprisingly easy to write such simple rules to match expected answers, if
assisted by an appropriate tool.
We propose a two-stage approach, inspired by Autoslog (17), that combines
the strength of humans in identifying semantically equivalent expressions and
the strength of the system in gathering statistical evidence from a human-
annotated corpus of documents. In the first stage, human subjects annotated
(using a highlighting tool) portions of on-topic documents that contained
answers to each nugget. 1
In the second stage, subjects used our rule generation
1 LDC (21) already provides relevance judgments for 100 topics on the TDT4 corpus. We
Search WWH ::




Custom Search