Database Reference
In-Depth Information
also give us the time of the current batch, allowing us to output each time period to a
different location. See Example 10-28 .
Example 10-28. Saving data to external systems with foreachRDD() in Scala
ipAddressRequestCount . foreachRDD { rdd =>
rdd . foreachPartition { partition =>
// Open connection to storage system (e.g. a database connection)
partition . foreach { item =>
// Use connection to push item to system
}
// Close connection
}
}
Input Sources
Spark Streaming has built-in support for a number of different data sources. Some
“core” sources are built into the Spark Streaming Maven artifact, while others are
available through additional artifacts, such as spark-streaming-kafka .
This section walks through some of these sources. It assumes that you already have
each input source set up, and is not intended to introduce the non-Spark-specific
components of any of these systems. If you are designing a new application, we rec‐
ommend trying HDFS or Kafka as simple input sources to get started with.
Core Sources
The methods to create DStream from the core sources are all available on the Stream‐
ingContext. We have already explored one of these sources in the example: sockets.
Here we discuss two more, files and Akka actors.
Stream of files
Since Spark supports reading from any Hadoop-compatible filesystem, Spark Stream‐
ing naturally allows a stream to be created from files written in a directory of a
Hadoop-compatible filesystem. This is a popular option due to its support of a wide
variety of backends, especially for log data that we would copy to HDFS anyway. For
Spark Streaming to work with the data, it needs to have a consistent date format for
the directory names and the files have to be created atomically (e.g., by moving the
file into the directory Spark is monitoring). 2 We can change Examples 10-4 and 10-5
2 Atomically means that the entire operation happens at once. This is important here since if Spark Streaming
were to start processing the file and then more data were to appear, it wouldn't notice the additional data. In
filesystems, the file rename operation is typically atomic.
 
Search WWH ::




Custom Search