Information Technology Reference

In-Depth Information

13.3.1.2 Actively Learning for Labeling

The idea of active learning is to control what is presented to the annotators (or users

in the case of click-through log mining). In this subsection, we will introduce three

pieces of work. The first two are designed for active learning in human annotation,

and the last one is designed for active learning in click-through log mining.

In [
33
], the most ambiguous set of documents is selected for labeling. As we

know, the support vectors in Ranking SVM [
20
,
21
] are the document pairs that are

the closest to the hyperplane of the model and thus the most ambiguous for learn-

ing. Therefore, this sampling principle can quickly identify the support vectors and

reduce the total number of labeled documents to achieve a high ranking accuracy.

The method is referred to as SVM selective sampling.

The basic idea of SVM selective sampling is as follows. Suppose at the
i
th round,

we are going to select a set of documents
S
i
from the entire unlabeled dataset for

labeling. The criterion here is that any two documents in the selected set are dif-

ficult to distinguish from each other (we denote this criterion as
C(S
i
)
for ease of

reference). To find the set according to this criterion is a typical combinatorial prob-

lem and thus very difficult to solve. To tackle the challenge, a theorem is given in

[
33
] showing that a subset of
R
that optimizes the selection criterion
C(S
i
)
must be

consecutive. And there exists an efficient way to find the optimal set by only search-

ing over the consecutive subsets in
R
. The overall complexity of the proposed new

method is in the linear order of the number of unlabeled documents.

In [
12
], a novel criterion is proposed to select document for labeling. That is, the

aim is to select a document
x
such that when added to the training set with a chosen

label
y
, the model trained on the new set would have less expected error on the test

set than labeling any other document. Since before labeling, no one knows the label

y
, an expectation is taken over all possible
y
as the final selection criterion.

The above criterion looks very intuitive and reasonable, however, it is not feasible

since it is very difficult to really compute the expected error. Alternatively, a method

is proposed to estimate how likely the addition of a new document will result in

the lowest expected error on the test set without any re-training on the enlarged

training set. The basic idea is to investigate the likelihood of a document to change

the current hypothesis significantly. The rationality of using this likelihood lies in

the following aspects: (i) adding a new data point to the labeled set can only change

the error on the test set if it changes the current hypothesis; (ii) the more significant

the change, the greater chance to learn the true hypothesis faster.

To realize the above idea, active sampling methods are designed for both Rank-

ing SVM [
20
,
21
] and RankBoost [
14
]. Suppose the loss change with respect to the

addition of document
x
is
D(w,x)
, and the minimization of this loss change will re-

sult in a new model parameter
w
. Then
w
is compared to the ranking model learned

in the previous round without document
x
. Basically the larger the difference is, the

more likely
x
will be effective in changing the hypothesis. The methods have been

tested on the LETOR benchmark datasets and promising results are obtained.

The limitation of the work is that one never knows whether the most significant

change to the original hypothesis is good or not before the document is labeled,