Database Reference
In-Depth Information
worker node running the receiver. Furthermore, if the worker running the receiver
fails, the system will try to launch the receiver at a different location, and Flume will
need to be reconfigured to send to the new worker. This is often challenging to set
up.
Pull-based receiver
The newer pull-based approach (added in Spark 1.1) is to set up a specialized Flume
sink Spark Streaming will read from, and have the receiver pull the data from the
sink. This approach is preferred for resiliency, as the data remains in the sink until
Spark Streaming reads and replicates it and tells the sink via a transaction.
To get started, we will need to set up the custom sink as a third-party plug-in for
Flume. The latest directions on installing plug-ins are in the Flume documentation .
Since the plug-in is written in Scala we need to add both the plug-in and the Scala
library to Flume's plug-ins. For Spark 1.1, the Maven coordinates are shown in
Example 10-37 .
Example 10-37. Maven coordinates for Flume sink
groupId = org.apache.spark
artifactId = spark-streaming-flume-sink_2.10
version = 1.2.0
groupId = org.scala-lang
artifactId = scala-library
version = 2.10.4
Once you have the custom flume sink added to a node, we need to configure Flume
to push to the sink, as we do in Example 10-38 .
Example 10-38. Flume configuration for custom sink
a1.sinks = spark
a1.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.spark.hostname = receiver-hostname
a1.sinks.spark.port = port-used-for-sync-not-spark-port
a1.sinks.spark.channel = memoryChannel
With the data being buffered in the sink we can now use FlumeUtils to read it, as
shown in Examples 10-39 and 10-40 .
Example 10-39. FlumeUtils custom sink in Scala
val events = FlumeUtils . createPollingStream ( ssc , receiverHostname , receiverPort )
Search WWH ::




Custom Search