Database Reference
In-Depth Information
Chapter 10. Real-time Machine Learning
with Spark Streaming
So far in this topic, we have focused on
batch
data processing. That is, all our analysis,
feature extraction, and model training has been applied to a fixed set of data that does not
change. This fits neatly into Spark's core abstraction of RDDs, which are immutable dis-
tributed datasets. Once created, the data underlying the RDD does not change, although we
might create new RDDs from the original RDD through Spark's transformation and action
operators.
Our attention has also been on batch machine learning models where we train a model on a
fixed batch of training data that is usually represented as an RDD of feature vectors (and la-
bels, in the case of supervised learning models).
In this chapter, we will:
• Introduce the concept of online learning, where models are trained and updated on
new data as it becomes available
• Explore stream processing using Spark Streaming
• See how Spark Streaming fits together with the online learning approach