Large-Scale Machine Learning - Mining of Massive Datasets

Database Reference

In-Depth Information

Figure 12.13 Perceptrons converge as soon as the separating hyperplane reaches the region between classes

Perceptrons on Streaming Data

While we have viewed the training set as stored data, available for repeated use on any number of passes, Perceptrons

can also be used in a stream setting. That is, we may suppose there is an infinite sequence of training examples, but

that each may be used only once. Detecting email spam is a good example of a training stream. Users report spam

emails and also report emails that were classified as spam but are not. Each email, as it arrives, is treated as a training

example, and modifies the current weight vector, presumably by a very small amount.

If the training set is a stream, we never really converge, and in fact the data points may well not be linearly sep-

arable. However, at all times, we have an approximation to the best possible separator. Moreover, if the examples in

the stream evolve over time, as would be the case for email spam, then we have an approximation that values recent

examples more than examples from the distant past, much like the exponentially decaying windows technique from

Section 4.7 .

12.2.8

Parallel Implementation of Perceptrons

The training of a perceptron is an inherently sequential process. If the number of dimen-

sions of the vectors involved is huge, then we might obtain some parallelism by computing

dot products in parallel. However, as we discussed in connection with Example 12.4 , high-

dimensional vectors are likely to be sparse and can be represented more succinctly than

would be expected from their length.

In order to get significant parallelism, we have to modify the perceptron algorithm

slightly, so that many training examples are used with the same estimated weight vector w .

As an example, let us formulate the parallel algorithm as a MapReduce job.

The Map Function Each Map task is given a chunk of training examples, and each Map

task knows the current weight vector w . The Map task computes w . x for each feature vec-

tor x = [ x 1 , x 2 , . . . , x k ] in its chunk and compares that dot product with the label y , which

is +1 or −1, associated with x . If the signs agree, no key-value pairs are produced for this

training example. However, if the signs disagree, then for each nonzero component x i of x

the key-value pair ( i , ηyx i ) is produced; here, η is the learning-rate constant used to train

this perceptron. Notice that cyx i is the increment we would like to add to the current i th

component of w , and if x i = 0, then there is no need to produce a key-value pair. However,

Search WWH ::

Custom Search

Home