Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications - page 243

Database Reference

In-Depth Information

Algorithm

6-class

50-class

(1)

78 . 8 (2)

Li and Roth

Hacioglu et al., SVM+ECOC

-

80.2-82

Zhang & Lee, LinearSVM q

87.4

79.2

Zhang & Lee, TreeSVM

90

-

SVM, “perfect” informer

94.2

88

SVM, CRF-informer

93.4

86.2

FIGURE 10.3 : Summary of % accuracy for UIUC data. (1) SNoW accuracy

without the related word dictionary was not reported. With the related-word

dictionary, it achieved 91%. (2) SNoW with a related-word dictionary achieved

84.2% but the other algorithms did not use it. Our results are summarized in

the last two rows; see text for details.

from these spans, a simple linear SVM beats all earlier approaches. This

confirms our suspicion that the earlier approaches suffered because they

generated spurious features from low-signal portions of the question.

10.2.2 Sequential Labeling of Type Clue Spans

In a real system, the atype informer span needs to be marked automatically

in the question. This turns out to be a more dicult problem. Syntactic

pattern-matching and heuristics widely used in QA systems are not very good

at capturing informer spans, as we shall see in Section 10.2.4 .

We will model the generation of the question token sequence as a Markov

chain. An automaton makes probabilistic transitions between hidden states

y , one of which is an “informer generating state,” and emits tokens x .We

observe the tokens and have to guess which were produced from the “informer

generating state.” Recent work has shown that conditional random fields

(CRFs) (26; 35) have a consistent advantage over traditional HMMs in the

face of many redundant features. We refer the reader to the above references

for a detailed treatment of CRFs.

Two common HMMs are used for text annotation and information

extraction. The first is the “in/out” model with two states. One (“in”)

state generates tokens that should be annotated as the informer span. The

other (“out”) state generates the remaining tokens. All transitions between

the two states must be allowed, which means that multiple “in” or informer

spans are possible in the output, which goes against our intuition above. The

second HMM is the 3-state “begin/in/out” (BIO) model, also widely used in

information extraction. The initial state cannot be “2” in the 3-state model;

all states can be final. These transitions allow at most one informer span.

The two state machines are shown in Figure 10.4 .

The BIO model is better than the in/out model for much the same

reasons as in information extraction, but we give some specific examples for

Next Page

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home