Database Reference
In-Depth Information
Adjacent cell features IsPrevTag and IsNextTag : Context can be
exploited by a CRF by coupling the state at position i with observations
at positions adjacent to position i (extending to larger windows did not help).
To capture this, we use more boolean features: position 4 fires the feature
IsPrevTag 1 , DT , 1 because x [3 , 1] .tag = DT and y 4 = 1. Position 4 also fires
IsPrevTag 1 , NP , 2 because x [3 , 2] .tag = NP and y 4 = 1. Similarly we define a
IsNextTag y,t, feature for each possible ( y, t, )triple.
State transition features IsEdge : Position i fires feature IsEdge u,v if
y i− 1 = u and y i = v . There is one such feature for each state-pair ( u, v )
allowed by the transition graph. In addition we have sentinel features
IsBegin u and IsEnd u marking the beginning and end of the token sequence.
Handling compound words: At first we collapsed compounds like
New_York_City (if found in WordNet) into a single token. Initial experiments
showed that compound detection is generally useful, but hurts accuracy when
it is wrong. (This is almost universal of front-end token processors.) We then
enhanced our code to detect a compound alert feature, but not collapse the
tokens. Instead, for every position i and state pair y 1 ,y 2 , we fired a special
feature (i.e., set the value to 1) if the compound detector claimed that x i− 1
and x i were parts of the same compound. This gave the CRF a robust bias
toward labeling a compound with a common state, without making this hard
policy, and boosted our accuracy slightly.
10.2.2.3
Heuristic informer annotation
Even if one concedes that informers provide valuable features, one may
question whether the elaborate mechanism using parse trees and CRFs is
necessary. In the literature, much simpler heuristics are often used to directly
extract the atype from a question. Singhal et al. (36) pick the head of the first
noun phrase detected by a shallow parser. Ramakrishnan et al. (32) use the
head of the noun phrase adjoining the main verb. The LASSO (31), FALCON
(17) and Webclopedia (18) systems use dozens to hundreds of (unpublished to
our knowledge) hand-built pattern-matching rules on the output of a full-scale
parser.
We would like to play off our CRF-based informer annotator against such
a heuristic annotator. We know of no readily available public code that
implements the latter class, so we implemented the following rules:
How: For questions starting with how , we use the bigram starting with how
unless the next word is a verb.
Wh: If the wh-word is not how , what or which , use the wh-word in the
question as a separate feature.
Search WWH ::




Custom Search