Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

NLP (such as part-of-speech tagging and sentence parsing), is increasingly

being achieved through machine learning. Li and Roth (27), Hacioglu and

Ward (16) and Zhang and Lee (40) have used supervised learning for question

classification.

The use of machine learning has enabled the above systems to handle larger

datasets and more complex type systems. A benchmark available from UIUC 1

is now standard. It has 6 coarse and 50 fine answer types in a two-level

taxonomy, together with 5500 training and 500 test questions. Webclopedia

(18) has also published its taxonomy with over 140 types.

Compared to other areas of text mining, question classification has benefited

from machine learning somewhat less than one might expect.

Li and Roth (27) used question features like tokens, parts of speech (POS),

chunks (non-overlapping phrases) and named entity (NE) tags. Some of

these features, such as part-of-speech, may themselves be generated from

sophisticated inference methods. Li and Roth achieved 78.8% accuracy for

50 classes. On using a hand-built dictionary of “semantically related words”

(unpublished, to our knowledge) the accuracy improved to 84.2%. It seems

desirable to use only off-the-shelf knowledge bases and labeled training data

consisting of questions and their atypes. Designing and maintaining the

dictionary may be comparable in effort to maintaining a rule base.

Support Vector Machines (SVMs) (38) have been widely successful in many

other learning tasks. SVMs were applied to question classification shortly

after the work of Li and Roth. Hacioglu and Ward (16) used linear support

vector machines with a very simple set of features: question word 2-grams.

E.g., the question “What is the tallest mountain in Africa?” leads to features

what is , is the , the tallest , ..., etc., which can be collected in a bag of 2-

grams. (It may help to mark the beginning 2-gram in some special way.) They

did not use any named-entity tags or related word dictionary. Early SVM

formulations and implementations usually handled two classes. Hacioglu and

Ward used a technique by Dietterich and Bakiri (12) to adapt two-class SVMs

to the multiclass setting in question classification. The high-level idea is to

represent class labels with carefully chosen numbers, represent the numbers in

the binary system and have one SVM predict each bit position. This is called

the “error-correcting output code” (ECOC) approach. The overall accuracy

was 80.2-82%, slightly higher than Li and Roth's baseline.

Zhang and Lee (40) used linear SVMs with all possible question word q -

grams, i.e., the above question now leads to features what , what is , what is

the , ..., is , is the , is the tallest , ..., etc. They obtained an accuracy of

79.2% without using ECOC, slightly higher than the Li and Roth baseline

but somewhat lower than Hacioglu and Ward. Zhang and Lee went on to

design an ingenious kernel on question parse trees, which yielded visible gains

for the 6 coarse labels in the UIUC classification system. The accuracy gain

1 http://l2r.cs.uiuc.edu/ ~ cogcomp/Data/QA/QC/

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home