Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Adjacent cell features IsPrevTag and IsNextTag : Context can be

exploited by a CRF by coupling the state at position i with observations

at positions adjacent to position i (extending to larger windows did not help).

To capture this, we use more boolean features: position 4 fires the feature

IsPrevTag 1 , DT , 1 because x [3 , 1] .tag = DT and y 4 = 1. Position 4 also fires

IsPrevTag 1 , NP , 2 because x [3 , 2] .tag = NP and y 4 = 1. Similarly we define a

IsNextTag y,t, feature for each possible ( y, t, )triple.

State transition features IsEdge : Position i fires feature IsEdge u,v if

y i− 1 = u and y i = v . There is one such feature for each state-pair ( u, v )

allowed by the transition graph. In addition we have sentinel features

IsBegin u and IsEnd u marking the beginning and end of the token sequence.

Handling compound words: At first we collapsed compounds like

New_York_City (if found in WordNet) into a single token. Initial experiments

showed that compound detection is generally useful, but hurts accuracy when

it is wrong. (This is almost universal of front-end token processors.) We then

enhanced our code to detect a compound alert feature, but not collapse the

tokens. Instead, for every position i and state pair y 1 ,y 2 , we fired a special

feature (i.e., set the value to 1) if the compound detector claimed that x i− 1

and x i were parts of the same compound. This gave the CRF a robust bias

toward labeling a compound with a common state, without making this hard

policy, and boosted our accuracy slightly.

10.2.2.3

Heuristic informer annotation

Even if one concedes that informers provide valuable features, one may

question whether the elaborate mechanism using parse trees and CRFs is

necessary. In the literature, much simpler heuristics are often used to directly

extract the atype from a question. Singhal et al. (36) pick the head of the first

noun phrase detected by a shallow parser. Ramakrishnan et al. (32) use the

head of the noun phrase adjoining the main verb. The LASSO (31), FALCON

(17) and Webclopedia (18) systems use dozens to hundreds of (unpublished to

our knowledge) hand-built pattern-matching rules on the output of a full-scale

parser.

We would like to play off our CRF-based informer annotator against such

a heuristic annotator. We know of no readily available public code that

implements the latter class, so we implemented the following rules:

How: For questions starting with how , we use the bigram starting with how

unless the next word is a verb.

Wh: If the wh-word is not how , what or which , use the wh-word in the

question as a separate feature.

Search WWH ::

Custom Search

Home