Real-time Machine Learning with Spark Streaming - Machine Learning with Spark

Database Reference

In-Depth Information

Streaming regression

Spark provides a built-in streaming machine learning model in the StreamingLin-

earAlgorithm class. Currently, only a linear regression implementation is avail-

able— StreamingLinearRegressionWithSGD —but future versions will include

classification.

The streaming regression model provides two methods for usage:

• trainOn : This takes DStream[LabeledPoint] as its argument. This tells

the model to train on every batch in the input DStream. It can be called multiple

times to train on different streams.

• predictOn : This also takes DStream[LabeledPoint] . This tells the model

to make predictions on the input DStream, returning a new DStream[Double]

that contains the model predictions.

Under the hood, the streaming regression model uses foreachRDD and map to accom-

plish this. It also updates the model variable after each batch and exposes the latest trained

model, which allows us to use this model in other applications or save it to an external loc-

ation.

The streaming regression model can be configured with parameters for step size and num-

ber of iterations in the same way as standard batch regression—the model class used is the

same. We can also set the initial model weight vector.

When we first start training a model, we can set the initial weights to a zero vector, or a

random vector, or perhaps load the latest model from the result of an offline batch process.

We can also decide to save the model periodically to an external system and use the latest

model state as the starting point (for example, in the case of a restart after a node or applic-

ation failure).

Search WWH ::

Custom Search

Home