Large-Scale Machine Learning - Mining of Massive Datasets

Database Reference

In-Depth Information

✦ Test Sets and Overfitting : When training some classifier on a training set, it is useful to remove some of the training

set and use the removed data as a test set. After producing a model or classifier without using the test set, we can run

the classifier on the test set to see how well it does. If the classifier does not perform as well on the test set as on the

training set used, then we have overfit the training set by conforming to peculiarities of the training-set data which

is not present in the data as a whole.

✦ Batch Versus On-Line Learning : In batch learning, the training set is available at any time and can be used in re-

peated passes. On-line learning uses a stream of training examples, each of which can be used only once.

✦ Perceptrons : This machine-learning method assumes the training set has only two class labels, positive and negative.

Perceptrons work when there is a hyperplane that separates the feature vectors of the positive examples from those

of the negative examples. We converge to that hyperplane by adjusting our estimate of the hyperplane by a fraction

- the learning rate - of the direction that is the average of the currently misclassified points.

✦ The Winnow Algorithm : This algorithm is a variant of the perceptron algorithm that requires components of the fea-

ture vectors to be 0 or 1. Training examples are examined in a round-robin fashion, and if the current classification

of a training example is incorrect, the components of the estimated separator where the feature vector has 1 are ad-

justed up or down, in the direction that will make it more likely this training example is correctly classified in the

next round.

✦ Nonlinear Separators : When the training points do not have a linear function that separates two classes, it may still

be possible to use a perceptron to classify them. We must find a function we can use to transform the points so that

in the transformed space, the separator is a hyperplane.

✦ Support-Vector Machines : The SVM improves upon perceptrons by finding a separating hyperplane that not only

separates the positive and negative points, but does so in a way that maximizes the margin - the distance perpen-

dicular to the hyperplane to the nearest points. The points that lie exactly at this minimum distance are the support

vectors. Alternatively, the SVM can be designed to allow points that are too close to the hyperplane, or even on the

wrong side of the hyperplane, but minimize the error due to such misplaced points.

✦ Solving the SVM Equations : We can set up a function of the vector that is normal to the hyperplane, the length of the

vector (which determines the margin), and the penalty for points on the wrong side of the margins. The regulariza-

tion parameter determines the relative importance of a wide margin and a small penalty. The equations can be solved

by several methods, including gradient descent and quadratic programming.

✦ Nearest-Neighbor Learning : In this approach to machine learning, the entire training set is used as the model. For

each (“query”) point to be classified, we search for its k nearest neighbors in the training set. The classification of

the query point is some function of the labels of these k neighbors. The simplest case is when k = 1, in which case

we can take the label of the query point to be the label of the nearest neighbor.

✦ Regression : A common case of nearest-neighbor learning, called regression, occurs when the there is only one fea-

ture vector, and it, as well as the label, are real numbers; i.e., the data defines a real-valued function of one variable.

To estimate the label, i.e., the value of the function, for an unlabeled data point, we can perform some computation

involving the k nearest neighbors. Examples include averaging the neighbors or taking a weighted average, where

the weight of a neighbor is some decreasing function of its distance from the point whose label we are trying to

determine.

12.7 References for Chapter 12

The perceptron was introduced in [ 11 ] . [ 7 ] introduces the idea of maximizing the margin

around the separating hyperplane. A well-known topic on the subject is [ 10 ] .

The Winnow algorithm is from [ 9 ]. Also see the analysis in [ 1 ] .

Search WWH ::

Custom Search

Home