Designing a Machine Learning System - Machine Learning with Spark

Database Reference

In-Depth Information

Batch versus real time

In the previous sections, we outlined the common batch processing approach, where the

model is retrained using all data or a subset of all data, periodically. As the preceding

pipeline takes some time to complete, it might not be possible to use this approach to up-

date models immediately as new data arrives.

While we will be mostly covering batch machine learning approaches in this topic, there is

a class of machine learning algorithms known as online learning ; they update immediately

as new data is fed into the model, thus enabling a real-time system. A common example is

an online-optimization algorithm for a linear model, such as stochastic gradient descent.

We can learn this algorithm using examples. The advantages of these methods are that the

system can react very quickly to new information and also that the system can adapt to

changes in the underlying behavior (that is, if the characteristics and distribution of the in-

put data are changing over time, which is almost always the case in real-world situations).

However, online-learning models come with their own unique challenges in a production

context. For example, it might be difficult to ingest and transform data in real time. It can

also be complex to properly perform model selection in a purely online setting. Latency of

the online training and the model selection and deployment phases might be too high for

true real-time requirements (for example, in online advertising, latency requirements are

measured in single-digit milliseconds). Finally, batch-oriented frameworks might make it

awkward to handle real-time processes of a streaming nature.

Fortunately, Spark's real-time stream processing component, Spark Streaming , is a good

potential fit for real-time machine learning workflows. We will explore Spark Streaming

and online learning in Chapter 10 , Real-time Machine Learning with Spark Streaming .

Due to the complexities inherent in a true real-time machine learning system, in practice,

many systems target near real-time operations. This is essentially a hybrid approach where

models are not necessarily updated immediately as new data arrives; instead, the new data

is collected into mini-batches of a small set of training data. These mini-batches can be fed

to an online-learning algorithm. In many cases, this approach is combined with a periodic

batch process that might recompute the model on the entire data set and perform more

complex processing and model selection. This can help ensure that the real-time model

does not degrade over time.

Another similar approach involves making approximate updates to a more complex model

as new data arrives while recomputing the entire model in a batch process periodically. In

Search WWH ::

Custom Search

Home