Spark Streaming - Learning Spark

Database Reference

In-Depth Information

also give us the time of the current batch, allowing us to output each time period to a

different location. See Example 10-28 .

Example 10-28. Saving data to external systems with foreachRDD() in Scala

ipAddressRequestCount . foreachRDD { rdd =>

rdd . foreachPartition { partition =>

// Open connection to storage system (e.g. a database connection)

partition . foreach { item =>

// Use connection to push item to system

}

// Close connection

}

Input Sources

Spark Streaming has built-in support for a number of different data sources. Some

“core” sources are built into the Spark Streaming Maven artifact, while others are

available through additional artifacts, such as spark-streaming-kafka .

This section walks through some of these sources. It assumes that you already have

each input source set up, and is not intended to introduce the non-Spark-specific

components of any of these systems. If you are designing a new application, we rec‐

ommend trying HDFS or Kafka as simple input sources to get started with.

Core Sources

The methods to create DStream from the core sources are all available on the Stream‐

ingContext. We have already explored one of these sources in the example: sockets.

Here we discuss two more, files and Akka actors.

Stream of files

Since Spark supports reading from any Hadoop-compatible filesystem, Spark Stream‐

ing naturally allows a stream to be created from files written in a directory of a

Hadoop-compatible filesystem. This is a popular option due to its support of a wide

variety of backends, especially for log data that we would copy to HDFS anyway. For

Spark Streaming to work with the data, it needs to have a consistent date format for

the directory names and the files have to be created atomically (e.g., by moving the

file into the directory Spark is monitoring). 2 We can change Examples 10-4 and 10-5

2 Atomically means that the entire operation happens at once. This is important here since if Spark Streaming

were to start processing the file and then more data were to appear, it wouldn't notice the additional data. In

filesystems, the file rename operation is typically atomic.

Search WWH ::

Custom Search

Home