Database Reference
In-Depth Information
Streaming regression
Spark provides a built-in streaming machine learning model in the StreamingLin-
earAlgorithm class. Currently, only a linear regression implementation is avail-
able— StreamingLinearRegressionWithSGD —but future versions will include
classification.
The streaming regression model provides two methods for usage:
trainOn : This takes DStream[LabeledPoint] as its argument. This tells
the model to train on every batch in the input DStream. It can be called multiple
times to train on different streams.
predictOn : This also takes DStream[LabeledPoint] . This tells the model
to make predictions on the input DStream, returning a new DStream[Double]
that contains the model predictions.
Under the hood, the streaming regression model uses foreachRDD and map to accom-
plish this. It also updates the model variable after each batch and exposes the latest trained
model, which allows us to use this model in other applications or save it to an external loc-
ation.
The streaming regression model can be configured with parameters for step size and num-
ber of iterations in the same way as standard batch regression—the model class used is the
same. We can also set the initial model weight vector.
When we first start training a model, we can set the initial weights to a zero vector, or a
random vector, or perhaps load the latest model from the result of an offline batch process.
We can also decide to save the model periodically to an external system and use the latest
model state as the starting point (for example, in the case of a restart after a node or applic-
ation failure).
Search WWH ::




Custom Search