Database Reference
In-Depth Information
Example 10-40. FlumeUtils custom sink in Java
JavaDStream < SparkFlumeEvent > events = FlumeUtils . createPollingStream ( ssc ,
receiverHostname , receiverPort )
In either case, the DStream is composed of SparkFlumeEvents . We can access the
underlying AvroFlumeEvent through event . If our event body was UTF-8 strings we
could get the contents as shown in Example 10-41 .
Example 10-41. SparkFlumeEvent in Scala
// Assuming that our flume events are UTF-8 log lines
val lines = events . map { e => new String ( e . event . getBody (). array (), "UTF-8" )}
Custom input sources
In addition to the provided sources, you can also implement your own receiver. This
is described in Spark's documentation in the Streaming Custom Receivers guide .
Multiple Sources and Cluster Sizing
As covered earlier, we can combine multiple DStreams using operations like union() .
Through these operators, we can combine data from multiple input DStreams. Some‐
times multiple receivers are necessary to increase the aggregate throughput of the
ingestion (if a single receiver becomes the bottleneck). Other times different receivers
are created on different sources to receive different kinds of data, which are then
combined using joins or cogroups.
It is important to understand how the receivers are executed in the Spark cluster to
use multiple ones. Each receiver runs as a long-running task within Spark's executors,
and hence occupies CPU cores allocated to the application. In addition, there need to
be available cores for processing the data. This means that in order to run multiple
receivers, you should have at least as many cores as the number of receivers, plus
however many are needed to run your computation. For example, if we want to run
10 receivers in our streaming application, then we have to allocate at least 11 cores.
Do not run Spark Streaming programs locally with master config‐
ured as "local" or "local[1]" . This allocates only one CPU for
tasks and if a receiver is running on it, there is no resource left to
process the received data. Use at least "local[2]" to have more
cores.
Search WWH ::




Custom Search