Spark Streaming - Learning Spark

Database Reference

In-Depth Information

worker node running the receiver. Furthermore, if the worker running the receiver

fails, the system will try to launch the receiver at a different location, and Flume will

need to be reconfigured to send to the new worker. This is often challenging to set

up.

Pull-based receiver

The newer pull-based approach (added in Spark 1.1) is to set up a specialized Flume

sink Spark Streaming will read from, and have the receiver pull the data from the

sink. This approach is preferred for resiliency, as the data remains in the sink until

Spark Streaming reads and replicates it and tells the sink via a transaction.

To get started, we will need to set up the custom sink as a third-party plug-in for

Flume. The latest directions on installing plug-ins are in the Flume documentation .

Since the plug-in is written in Scala we need to add both the plug-in and the Scala

library to Flume's plug-ins. For Spark 1.1, the Maven coordinates are shown in

Example 10-37 .

Example 10-37. Maven coordinates for Flume sink

groupId = org.apache.spark

artifactId = spark-streaming-flume-sink_2.10

version = 1.2.0

groupId = org.scala-lang

artifactId = scala-library

version = 2.10.4

Once you have the custom flume sink added to a node, we need to configure Flume

to push to the sink, as we do in Example 10-38 .

Example 10-38. Flume configuration for custom sink

a1.sinks = spark

a1.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink

a1.sinks.spark.hostname = receiver-hostname

a1.sinks.spark.port = port-used-for-sync-not-spark-port

a1.sinks.spark.channel = memoryChannel

With the data being buffered in the sink we can now use FlumeUtils to read it, as

shown in Examples 10-39 and 10-40 .

Example 10-39. FlumeUtils custom sink in Scala

val events = FlumeUtils . createPollingStream ( ssc , receiverHostname , receiverPort )

Search WWH ::

Custom Search

Home