Information Technology Reference
In-Depth Information
is to reduce the version space as fast as possible to reach the solution faster
in order to avoid certain costs associated with the problem. For the possibility
of a nonsymmetric version space, there are more complex selection methods
suggested by Tong and Koller [8], but the advantage of those methods are not
significant, considering their high computational costs.
6.3.2.1 Active Learning with Small Pools The basic working principle of
margin-based AL with SVMs is: (i) train an SVM on the existing training data,
(ii) select the closest example to the hyperplane, and (iii) add the new selected
example to the training set and train again. In classical AL [8], the search for the
most informative example is performed over the entire dataset. Note that, each
iteration of AL involves the recomputation of each training example's distance
to the new hyperplane. Therefore, for large datasets, searching the entire training
set is a very time-consuming and computationally expensive task.
One possible remedy for this performance bottleneck is to use the “59 trick”
[27], which alleviates a full search through the entire dataset, approximating the
most informative examples by examining a small constant number of randomly
chosen samples. The method picks L ( L
# training examples) random training
samples in each iteration and selects the best (closest to the hyperplane) among
them. Suppose, instead of picking the closest example among all the training
samples X N = (x 1 ,x 2 ,...,x N ) at each iteration, we first pick a random subset
X L , L N and select the closest sample x i from X L based on the condition
that x i is among the top p % closest instances in X N with probability ( 1 η) .
Any numerical modification to these constraints can be met by varying the size
of L , and is independent of N . To demonstrate this, the probability that at least
one of the L instances is among the closest p is 1 ( 1 p) L . Owing to the
requirement of ( 1 η) probability, we have
1 ( 1 p) L
= 1 η
(6.6)
which follows the solution of L in terms of η and p
log η
log ( 1 p)
L
=
(6.7)
For example, the active learner will pick one example, with 95% probability, that
is among the top 5% closest instances to the hyperplane, by randomly sampling
only log ( 0 . 05 )/ log ( 0 . 95 ) = 59 examples regardless of the training set size.
This approach scales well since the size of the subset L is independent of the
training set size N , requires significantly less training time, and does not have
an adverse effect on the classification performance of the learner.
6.3.3 Active Learning with Online Learning
Online learning algorithms are usually associated with problems where the com-
plete training set is not available. However, in cases where the complete training
Search WWH ::




Custom Search