Classification: Advanced Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

of features should be sufficient to train a good classifier. Suppose we split the feature

set into two sets and train two classifiers, f 1 and f 2 , where each classifier is trained on a

different set. Then, f 1 and f 2 are used to predict the class labels for the unlabeled data,

X u . Each classifier then teaches the other in that the tuple having the most confident

prediction from f 1 is added to the set of labeled data for f 2 (along with its label).

Similarly, the tuple having the most confident prediction from f 2 is added to the set of

labeled data for f 1 . The method is summarized in Figure 9.17. Cotraining is less sensitive

to errors than self-training. A difficulty is that the assumptions for its usage may not

hold true, that is, it may not be possible to split the features into mutually exclusive and

class-conditionally independent sets.

Alternate approaches to semi-supervised learning exist. For example, we can model

the joint probability distribution of the features and the labels. For the unlabeled data,

the labels can then be treated as missing data. The EM algorithm (Chapter 11) can be

used to maximize the likelihood of the model. Methods using support vector machines

have also been proposed.

9.7.3 Active Learning

Active learning is an iterative type of supervised learning that is suitable for situations

where data are abundant, yet the class labels are scarce or expensive to obtain. The learn-

ing algorithm is active in that it can purposefully query a user (e.g., a human oracle) for

labels. The number of tuples used to learn a concept this way is often much smaller than

the number required in typical supervised learning.

“ How does active learning work to overcome the labeling bottleneck? ” To keep costs

down, the active learner aims to achieve high accuracy using as few labeled instances

as possible. Let D be all of data under consideration. Various strategies exist for active

learning on D . Figure 9.18 illustrates a pool-based approach to active learning. Suppose

that a small subset of D is class-labeled. This set is denoted L . U is the set of unlabeled

data in D . It is also referred to as a pool of unlabeled data. An active learner begins with

L as the initial training set. It then uses a querying function to carefully select one or

more data samples from U and requests labels for them from an oracle (e.g., a human

annotator). The newly labeled samples are added to L , which the learner then uses in

a standard supervised way. The process repeats. The active learning goal is to achieve

high accuracy using as few labeled tuples as possible. Active learning algorithms are

typically evaluated with the use of learning curves, which plot accuracy as a function of

the number of instances queried.

Most of the active learning research focuses on how to choose the data tuples to

be queried. Several frameworks have been proposed. Uncertainty sampling is the most

common, where the active learner chooses to query the tuples which it is the least cer-

tain how to label. Other strategies work to reduce the version space , that is, the subset

of all hypotheses that are consistent with the observed training tuples. Alternatively,

we may follow a decision-theoretic approach that estimates expected error reduction.

This selects tuples that would result in the greatest reduction in the total number of

Search WWH ::

Custom Search

Home