Large-Scale Machine Learning - Mining of Massive Datasets

Database Reference

In-Depth Information

Support-vector machines appeared in [ 6 ]. [ 5 ] and [ 4 ] are useful surveys. [ 8 ] talks about

a more efficient algorithm for the case of sparse features (most components of the feature

vectors are zero). The use of gradient-descent methods is found in [ 2 ] , [ 3 ] .

[1] A. Blum, “Empirical support for winnow and weighted-majority algorithms: results on a calendar scheduling do-

main,” Machine Learning 26 (1997), pp. 5-23.

[2] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” Proc. 19th Intl. Conf. on Computa-

tional Statistics (2010), pp. 177-187, Springer.

[3] L. Bottou, “Stochastic gradient tricks, neural networks,” in Tricks of the Trade, Reloaded , pp. 430-445, edited by

G. Montavon, G.B. Orr and K.-R. Mueller, Lecture Notes in Computer Science (LNCS 7700), Springer, 2012.

[4] C.J.C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Dis-

covery 2 (1998), pp. 121-167.

[5] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learn-

ing Methods , Cambridge University Press, 2000.

[6] C. Cortes and V.N. Vapnik, “Support-vector networks,” Machine Learning 20 (1995), pp. 273-297.

[7] Y. Freund and R.E. Schapire, “Large margin classification using the perceptron algorithm,” Machine Learning 3

(1999), pp. 277-296.

[8] T. Joachims, “Training linear SVMs in linear time.” Proc. 12th ACM SIGKDD (2006), pp. 217-226.

[9] N. Littlestone, “Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm,” Machine

Learning 2 (1988), pp. 285-318.

[10] M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry (2nd edition), MIT Press,

Cambridge MA, 1972.

[11] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain,” Psy-

chological Review 65:6 (1958), pp. 386-408.

1 Constant b in this formulation of a hyperplane is the same as the negative of the threshold θ in our treatment of per-

ceptrons in Section 12.2 .

2 Note, however, that d there has become d + 1 here, since we include b as one of the components of w when taking the

derivative.

3 While the region belonging to any one point is convex, the union of the regions for two or more points might not be

convex. Thus, in Fig. 12.21 we see that the region for all Dachshunds and the region for all Beagles are not convex.

That is, there are points p 1 and p 2 that are both classified as Dachshunds, but the midpoint of the line between p 1 and

p 2 is classified as a Beagle, and vice versa.