Real-time Machine Learning with Spark Streaming - Machine Learning with Spark

Database Reference

In-Depth Information

An introduction to Spark Streaming

There are a few different general techniques to deal with stream processing. Two of the

most common ones are as follows:

• Treat each record individually and process it as soon as it is seen.

• Combine multiple records into mini-batches . These mini-batches can be delin-

eated either by time or by the number of records in a batch.

Spark Streaming takes the second approach. The core primitive in Spark Streaming is the

discretized stream , or DStream . A DStream is a sequence of mini-batches, where each

mini-batch is represented as a Spark RDD:

The discretized stream abstraction

A DStream is defined by its input source and a time window called the batch interval . The

stream is broken up into time periods equal to the batch interval (beginning from the start-

ing time of the application). Each RDD in the stream will contain the records that are re-

ceived by the Spark Streaming application during a given batch interval. If no data is

present in a given interval, the RDD will simply be empty.

Input sources

Spark Streaming receivers are responsible for receiving data from an input source and

converting the raw data into a DStream made up of Spark RDDs.

Spark Streaming supports various input sources, including file-based sources (where the re-

ceiver watches for new files arriving at the input location and creates the DStream from the

contents read from each new file) and network-based sources (such as receivers that com-

municate with socket-based sources, the Twitter API stream, Akka actors, or message

queues and distributed stream and log transfer frameworks, such Flume, Kafka, and

Amazon Kinesis).

Search WWH ::

Custom Search

Home