Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

as a table, then generate CRF features from the table. The table for the parse

tree in Figure 10.5 is shown in Figure 10.6 .

10.2.2.2

Cells and attributes

A labeled question comprises the token sequence x i ; i =1 ,... and the label

sequence y i ,i =1 ,.... Each x i leads to a column vector of observations.

Therefore we use matrix notation to write down x : A table cell is addressed

as x [ i, ]where i is the token position (column index) and is the level or row

index, 1-6 in this example. (Although the parse tree can be arbitrarily deep,

we found that using features from up to level =2wasadequate.)

Intuitively, much of the information required for spotting an informer

can be obtained from the part of speech of the tokens and phrase/clause

attachment information. Conversely, specific word information is generally

sparse and potentially misleading; the same word may or may not be an

informer depending on its position, e.g., “What birds eat snakes?” and “What

snakes eat birds?” have the same words but different informers. Accordingly,

we observe two properties at each cell:

tag : The syntactic class assigned to the cell by the parser, e.g., x [4 , 2] .tag =

NP . It is well known that POS and chunk information are major clues to

informer-tagging, specifically, informers are often nouns or noun phrases.

num : Many heuristics exploit the fact that the first NP is known to have a

higher chance of containing informers than subsequent NPs. To capture this

positional information, we define num of a cell at [ i, ] as one plus the number

of distinct contiguous chunks to the left of [ i, ]with tag sequalto x [4 , 2] .tag .

E.g., at level 2 in the table above, the capital city forms the first NP, while

Japan forms the second NP. Therefore x [7 , 2] .num =2.

In conditional models, it is notationally convenient to express features as

functions on ( x i ,y i ). To one unfamiliar with CRFs, it may seem strange that

y i is passed as an argument to features. At training time, y i is indeed known,

and at testing time, the CRF algorithm eciently finds the most probable

sequence of y i s using a Viterbi search. True labels are not revealed to the

CRF at testing time.

Cell features IsTag and IsNum : E.g., the observation “ y 4 =1and

x [4 , 2] .tag = NP ” is captured by the statement that “position 4 fires the

feature IsTag 1 , NP , 2 ” (which has a boolean value). There is an IsTag y,t, feature

for each ( y, t, ) triplet, where y is the state, t is the POS, and is the level.

Similarly, for every possible state y , every possible num value n (up to some

maximum horizon) and every level , we define boolean features IsNum y,n, .

E.g., position 7 fires the feature IsNum 2 , 2 , 2 in the 3-state model, capturing the

statement “ x [7 , 2] .num =2and y 7 = 2”.

Search WWH ::

Custom Search

Home