Real-time Machine Learning with Spark Streaming - Machine Learning with Spark

Database Reference

In-Depth Information

Online learning

The batch machine learning methods that we have applied in this topic focus on processing

an existing fixed set of training data. Typically, these techniques are also iterative, and we

have performed multiple passes over our training data in order to converge to an optimal

model.

By contrast, online learning is based on performing only one sequential pass through the

training data in a fully incremental fashion (that is, one training example at a time). After

seeing each training example, the model makes a prediction for this example and then re-

ceives the true outcome (for example, the label for classification or real target for regres-

sion). The idea behind online learning is that the model continually updates as new inform-

ation is received instead of being retrained periodically in batch training.

In some settings, when data volume is very large or the process that generates the data is

changing rapidly, online learning methods can adapt more quickly and in near real time,

without needing to be retrained in an expensive batch process.

However, online learning methods do not have to be used in a purely online manner. In

fact, we have already seen an example of using an online learning model in the batch set-

ting when we used stochastic gradient descent optimization to train our classification and

regression models. SGD updates the model after each training example. However, we still

made use of multiple passes over the training data in order to converge to a better result.

In the pure online setting, we do not (or perhaps cannot) make multiple passes over the

training data; hence, we need to process each input as it arrives. Online methods also in-

clude mini-batch methods where, instead of processing one input at a time, we process a

small batch of training data.

Online and batch methods can also be combined in real-world situations. For example, we

can periodically retrain our models offline (say, every day) using batch methods. We can

then deploy the trained model to production and update it using online methods in real time

(that is, during the day, in between batch retraining) to adapt to any changes in the environ-

ment.

As we will see in this chapter, the online learning setting can fit neatly into stream process-

ing and the Spark Streaming framework.

Search WWH ::

Custom Search

Home