Database Reference
In-Depth Information
first point has become too close to the separating hyperplane, even though it is properly
classified.
We can see that in line (5) of Fig. 12.20 , the problems with the first and third points are
corrected, and we go back to pattern oxoxxx of bad points. However, at line (6), the points
have again become too close to the separating hyperplane, so we revert to the xxxxxx pat-
tern of bad points. You are invited to continue the sequence of updates to w and b for sev-
eral more iterations.
One might wonder why the gradient-descent process seems to be converging on a solu-
tion where at least some of the points are inside the margin, when there is an obvious hy-
perplane (horizontal, at height 1.5) with a margin of 1/2, that separates the positive and
negative points. The reason is that when we picked C = 0.1 we were saying that we really
don't care too much whether there are points inside the margins, or even if points are mis-
classified. We were saying also that what was important was a large margin (which corres-
ponds to a small || w ||), even if some points violated that same margin.
12.3.5
Stochastic Gradient Descent
The gradient-descent algorithm described in Section 12.3.4 is often called batch gradient
descent, because at each round, all the training examples are considered as a “batch.” While
it is effective on small datasets, it can be too time-consuming to execute on a large dataset,
where we must visit every training example, often many times, before convergence.
An alternative, called stochastic gradient descent, considers one training example, or a
few training examples at a time and adjusts the current estimate of the error function ( w
in the SVM example) in the direction indicated by only the small set of training examples
considered. Additional rounds are possible, using other sets of training examples; these can
be selected randomly or according to some fixed strategy. Note that it is normal that some
members of the training set are never used in a stochastic gradient descent algorithm.
EXAMPLE 12.10 Recall the UV-decomposition algorithm discussed in Section 9.4.3 . This
algorithm was described as an example of batch gradient descent. We can regard each of
the nonblank entries in the matrix M we are trying to approximate by the product UV as a
training example, and the error function is the root-mean-square error between the product
of the current matrices U and V and the matrix M , considering only those elements where
M is nonblank.
However, if M has a very large number of nonblank entries, as would be the case if M
represented, say, purchases of items by Amazon customers or movies that Netflix custom-
ers had rated, then it is not practical to make repeated passes over the entire set of nonblank
entries of M when adjusting the entries in U and V . A stochastic gradient descent would
Search WWH ::




Custom Search