Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

for the 50 fine-grained classes was surprisingly small. The authors explain

this in the following terms: “the syntactic tree does not normally contain the

information required to distinguish between the various fine categories within

a coarse category.”

10.2.1 Answer Type Clues in Questions

We contend that the above methods for generating features from the

question overload the learner with too many features too far from the critical

question tokens that reveal the richest clues to the atype.

In fact, our experiments show that a very short (typically 1-3 word)

subsequence of question tokens are adequate clues for question classification,

at least by humans. We call these segments informer spans . This is certainly

true of the most trivial atypes ( Who wrote Hamlet? or How many dogs pull a

sled at Iditarod?) but is also true of more subtle clues (How much does a rhino

weigh ?). Informal experiments revealed the surprising property that only one

segment is enough. In the above question, a human does not even need the

how much clue (which hints at only a generic quantity) once the word weigh is

available. In fact, “How much does a rhino cost ?” has an identical syntax but

an atype that is a completely different subtype of “quantity,” not revealed by

how much alone. The only exceptions to the single-span hypothesis are multi-

function questions like “What is the name and age of ... ,” which should be

assigned to multiple answer types. In this paper we consider questions where

one type suces.

Consider another question with multiple clues: Who is the CEO of IBM?

In isolation, the clue who merely tells us that the answer might be a person or

country or perhaps an organization, while CEO is perfectly precise, rendering

who unnecessary. All of the above applies a forteriori to what and which

clues, which are essentially uninformative on their own, as in “What is the

distance between Pisa and Rome?”

The informer span is very sensitive to the structure of clauses, phrases

and possessives in the question, as is clear from these examples (informers

italicized): “What is Bill Clinton's wife's profession ,” and “What country 's

president was shot at Ford's Theater.” Depending on sentence structure, the

informer can be near to or far from question triggers like what , which and

how .

The choice of informer spans also depends on the target classification

system. Initially we wished to handle definition questions separately, and

marked no informer tokens in “What is digitalis.” However, what is is an

excellent informer for the UIUC question class marked “definition” DESC:def .

Before we get into the job of annotating the question with the informer

segment, we summarize the accuracy obtained by some of the approaches

reviewed earlier, as well as by a linear SVM that was provided with suitable

features generated from the informer segment (details in Section 10.2.3). If

“perfect” informer spans are labeled by hand, and features generated only

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home