Information Technology Reference
184.108.40.206 Actively Learning for Labeling
The idea of active learning is to control what is presented to the annotators (or users
in the case of click-through log mining). In this subsection, we will introduce three
pieces of work. The first two are designed for active learning in human annotation,
and the last one is designed for active learning in click-through log mining.
In [ 33 ], the most ambiguous set of documents is selected for labeling. As we
know, the support vectors in Ranking SVM [ 20 , 21 ] are the document pairs that are
the closest to the hyperplane of the model and thus the most ambiguous for learn-
ing. Therefore, this sampling principle can quickly identify the support vectors and
reduce the total number of labeled documents to achieve a high ranking accuracy.
The method is referred to as SVM selective sampling.
The basic idea of SVM selective sampling is as follows. Suppose at the i th round,
we are going to select a set of documents S i from the entire unlabeled dataset for
labeling. The criterion here is that any two documents in the selected set are dif-
ficult to distinguish from each other (we denote this criterion as C(S i ) for ease of
reference). To find the set according to this criterion is a typical combinatorial prob-
lem and thus very difficult to solve. To tackle the challenge, a theorem is given in
[ 33 ] showing that a subset of R that optimizes the selection criterion C(S i ) must be
consecutive. And there exists an efficient way to find the optimal set by only search-
ing over the consecutive subsets in R . The overall complexity of the proposed new
method is in the linear order of the number of unlabeled documents.
In [ 12 ], a novel criterion is proposed to select document for labeling. That is, the
aim is to select a document x such that when added to the training set with a chosen
label y , the model trained on the new set would have less expected error on the test
set than labeling any other document. Since before labeling, no one knows the label
y , an expectation is taken over all possible y as the final selection criterion.
The above criterion looks very intuitive and reasonable, however, it is not feasible
since it is very difficult to really compute the expected error. Alternatively, a method
is proposed to estimate how likely the addition of a new document will result in
the lowest expected error on the test set without any re-training on the enlarged
training set. The basic idea is to investigate the likelihood of a document to change
the current hypothesis significantly. The rationality of using this likelihood lies in
the following aspects: (i) adding a new data point to the labeled set can only change
the error on the test set if it changes the current hypothesis; (ii) the more significant
the change, the greater chance to learn the true hypothesis faster.
To realize the above idea, active sampling methods are designed for both Rank-
ing SVM [ 20 , 21 ] and RankBoost [ 14 ]. Suppose the loss change with respect to the
addition of document x is D(w,x) , and the minimization of this loss change will re-
sult in a new model parameter w . Then w is compared to the ranking model learned
in the previous round without document x . Basically the larger the difference is, the
more likely x will be effective in changing the hypothesis. The methods have been
tested on the LETOR benchmark datasets and promising results are obtained.
The limitation of the work is that one never knows whether the most significant
change to the original hypothesis is good or not before the document is labeled,