Large-Scale Machine Learning - Mining of Massive Datasets

Database Reference

In-Depth Information

first point has become too close to the separating hyperplane, even though it is properly

classified.

We can see that in line (5) of Fig. 12.20 , the problems with the first and third points are

corrected, and we go back to pattern oxoxxx of bad points. However, at line (6), the points

have again become too close to the separating hyperplane, so we revert to the xxxxxx pat-

tern of bad points. You are invited to continue the sequence of updates to w and b for sev-

eral more iterations.

One might wonder why the gradient-descent process seems to be converging on a solu-

tion where at least some of the points are inside the margin, when there is an obvious hy-

perplane (horizontal, at height 1.5) with a margin of 1/2, that separates the positive and

negative points. The reason is that when we picked C = 0.1 we were saying that we really

don't care too much whether there are points inside the margins, or even if points are mis-

classified. We were saying also that what was important was a large margin (which corres-

ponds to a small || w ||), even if some points violated that same margin.

□

12.3.5

Stochastic Gradient Descent

The gradient-descent algorithm described in Section 12.3.4 is often called batch gradient

descent, because at each round, all the training examples are considered as a “batch.” While

it is effective on small datasets, it can be too time-consuming to execute on a large dataset,

where we must visit every training example, often many times, before convergence.

An alternative, called stochastic gradient descent, considers one training example, or a

few training examples at a time and adjusts the current estimate of the error function ( w

in the SVM example) in the direction indicated by only the small set of training examples

considered. Additional rounds are possible, using other sets of training examples; these can

be selected randomly or according to some fixed strategy. Note that it is normal that some

members of the training set are never used in a stochastic gradient descent algorithm.

EXAMPLE 12.10 Recall the UV-decomposition algorithm discussed in Section 9.4.3 . This

algorithm was described as an example of batch gradient descent. We can regard each of

the nonblank entries in the matrix M we are trying to approximate by the product UV as a

training example, and the error function is the root-mean-square error between the product

of the current matrices U and V and the matrix M , considering only those elements where

M is nonblank.

However, if M has a very large number of nonblank entries, as would be the case if M

represented, say, purchases of items by Amazon customers or movies that Netflix custom-

ers had rated, then it is not practical to make repeated passes over the entire set of nonblank

entries of M when adjusting the entries in U and V . A stochastic gradient descent would

Search WWH ::

Custom Search

Home