Large-Scale Machine Learning - Mining of Massive Datasets

Database Reference

In-Depth Information

. . . , x id ]. Also, as before, w = [ w 1 , w 2 , . . . , w d ]. Note that the two summations express

the dot product of vectors.

The constant C , called the regularization parameter , reflects how important misclassi-

fication is. Pick a large C if you really do not want to misclassify points, but you would

accept a narrow margin. Pick a small C if you are OK with some misclassified points, but

want most of the points to be far away from the boundary (i.e., the margin is large).

We must explain the penalty function (second term) in Equation 12.4 . The summation

over i has one term

for each training example x i . The quantity L is a hinge function , suggested in Fig. 12.17 ,

and we call its value the hinge loss . Let When z i is 1 or more, the value of L is 0.

But for smaller values of z i , L rises linearly as z i decreases.

Figure 12.17 The hinge function decreases linearly for z ≤ 1 and then remains 0

Since we shall have need to take the derivative with respect to each w j of L ( x i , y i ), note

that the derivative of the hinge function is discontinuous. It is − y i x ij for z i < 1 and 0 for z i >

1. That is, if y i = +1 (i.e., the i th training example is positive), then

Moreover, if y i = −1 (i.e., the i th training example is negative), then

The two cases can be summarized as one, if we include the value of y i , as:

(12.5)

Search WWH ::

Custom Search

Home