Utility-Based Information Distillation - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

9.4.1.1

Automating evaluations based on answer keys

Automatic evaluation methods would allow for faster system building and

tuning, as well as provide an objective and affordable way of comparing various

systems. Recently, such methods have been proposed, more or less, based on

the idea of n-gram co-occurrences. Pourpre (13) assigns a fractional recall

score to a system response based on its unigram overlap with a given nugget's

description. For example, a system response 'ABC' has recall 3/4 with

respect to a nugget with description 'ABCD .' However, such an approach

is unfair to systems that present the same information but using words other

than A, B, C and D . Another open issue is how to weight individual words

in measuring the closeness of a match. For example, consider the question

“How many prisoners escaped?” In the nugget 'Seven prisoners escaped from

a Texas prison,' there is no indication that 'seven' is the keyword, and that

it must be matched to get any relevance credit. Using IDF values does not

help, since 'seven' will generally not have a higher IDF than words like 'texas'

and 'prison' - an observation of ours supported by the results reported by

the authors of Pourpre. Also, redefining the nugget as just 'seven' does not

solve the problem since now it might spuriously match any mention of 'seven'

out of context. Nuggeteer (16) works on similar principles but makes binary

decisions about whether a nugget is present in a given system response by

tuning a threshold. However, it is also plagued by 'spurious relevance' since

not all words of the nugget description (or known correct responses) are central

to the nugget.

9.4.1.2

Nugget-matching rules

We propose a reliable automatic method for determining whether a snippet

of text contains a given nugget, based on nugget-matching rules ,whichare

generated using a semi-automatic procedure explained below. These rules are

essentially boolean queries that will only match against snippets that contain

the nugget. For instance, a candidate rule for matching answers to “How many

prisoners escaped?” is (Texas AND seven AND escape AND (convicts OR

prisoners)) , possibly with other synonyms and variants in the rule. For

a corpus of news articles, which usually follow a typical formal prose, it is

surprisingly easy to write such simple rules to match expected answers, if

assisted by an appropriate tool.

We propose a two-stage approach, inspired by Autoslog (17), that combines

the strength of humans in identifying semantically equivalent expressions and

the strength of the system in gathering statistical evidence from a human-

annotated corpus of documents. In the first stage, human subjects annotated

(using a highlighting tool) portions of on-topic documents that contained

answers to each nugget. 1

In the second stage, subjects used our rule generation

1 LDC (21) already provides relevance judgments for 100 topics on the TDT4 corpus. We

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home