Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

in tagging, the SVM is given a chance to correlate these mistakes to the true

label. In contrast, in the first approach, the SVM may see test data that

is distributionally different from the training data, and training data is of

higher quality because the informer spans are human-generated. For these

reasons, we implemented the second option. We have anecdotal evidence that

the accuracy of the second approach is somewhat higher, because we subject

the SVM to the limitations of the CRF output uniformly during both training

and testing.

The SVM used is a linear multi-class one-vs-one SVM 2 ,asintheZhang

and Lee (40) baseline. We do not use ECOC (16) because the reported gain is

less than 1%. Through tuning, we found that the SVM “ C ” parameter (used

to trade between training data fit and model complexity) must be set to 300

to achieve published baseline numbers.

10.2.3.1

Informer q -gram features

Our main modification to earlier SVM-based approaches is in generating

features from informers. In earlier work, word features were generated from

word q -grams. We can apply the same method to the informer span, e.g.,

for the question “What is the height of Mount Everest?” where height is the

informer span, we generate a feature corresponding to height . (We will also

generate regular word features; therefore we have to tag the features so that

'height' occurring inside the informer span generates a distinct feature from

'height' occurring outside the informer span.)

As in regular text classification, the goal is to reveal to the learner

important correlations between informer features and question classes, e.g.,

the UIUC label system has a class called NUMBER:distance . We would expect

informers like length or height to be strongly correlated with the class label

NUMBER:distance .

10.2.3.2

Informer hypernym features

Another set of features generated from informer tokens proves to be

valuable. The class label NUMBER:distance is correlated with a number of

potential informer q -grams, such as height , how far , how long , how many

miles , etc. In an ideal setting, given very large amounts of labeled data, all

such correlations can be learnt automatically. In real life, training data is

limited. As a second example, the UIUC label system has a single coarse-

grained class called HUMAN:individual , whereas questions may use diverse

atype informer tokens like author , cricketer or CEO .

There are prebuilt databases such as WordNet (30) where explicit

hypernym-hyponym ( x is a kind of y ) relations are cataloged as a directed

acyclic graph of types. For example, author , cricketer , CEO would all connect

2 http://www.csie.ntu.edu.tw/ ~ cjlin/libsvm/

Search WWH ::

Custom Search

Home