Database Reference
In-Depth Information
Batch versus real time
In the previous sections, we outlined the common batch processing approach, where the
model is retrained using all data or a subset of all data, periodically. As the preceding
pipeline takes some time to complete, it might not be possible to use this approach to up-
date models immediately as new data arrives.
While we will be mostly covering batch machine learning approaches in this topic, there is
a class of machine learning algorithms known as online learning ; they update immediately
as new data is fed into the model, thus enabling a real-time system. A common example is
an online-optimization algorithm for a linear model, such as stochastic gradient descent.
We can learn this algorithm using examples. The advantages of these methods are that the
system can react very quickly to new information and also that the system can adapt to
changes in the underlying behavior (that is, if the characteristics and distribution of the in-
put data are changing over time, which is almost always the case in real-world situations).
However, online-learning models come with their own unique challenges in a production
context. For example, it might be difficult to ingest and transform data in real time. It can
also be complex to properly perform model selection in a purely online setting. Latency of
the online training and the model selection and deployment phases might be too high for
true real-time requirements (for example, in online advertising, latency requirements are
measured in single-digit milliseconds). Finally, batch-oriented frameworks might make it
awkward to handle real-time processes of a streaming nature.
Fortunately, Spark's real-time stream processing component, Spark Streaming , is a good
potential fit for real-time machine learning workflows. We will explore Spark Streaming
and online learning in Chapter 10 , Real-time Machine Learning with Spark Streaming .
Due to the complexities inherent in a true real-time machine learning system, in practice,
many systems target near real-time operations. This is essentially a hybrid approach where
models are not necessarily updated immediately as new data arrives; instead, the new data
is collected into mini-batches of a small set of training data. These mini-batches can be fed
to an online-learning algorithm. In many cases, this approach is combined with a periodic
batch process that might recompute the model on the entire data set and perform more
complex processing and model selection. This can help ensure that the real-time model
does not degrade over time.
Another similar approach involves making approximate updates to a more complex model
as new data arrives while recomputing the entire model in a batch process periodically. In
Search WWH ::




Custom Search