Database Reference
In-Depth Information
In a similar way, for windowed operations, the interval at which you compute a result
(i.e., the slide interval) has a big impact on performance. Consider increasing this
interval for expensive computations if it is a bottleneck.
Level of Parallelism
A common way to reduce the processing time of batches is to increase the parallel‐
ism. There are three ways to increase the parallelism:
Increasing the number of receivers
Receivers can sometimes act as a bottleneck if there are too many records for a
single machine to read in and distribute. You can add more receivers by creating
multiple input DStreams (which creates multiple receivers), and then applying
union to merge them into a single stream.
Explicitly repartitioning received data
If receivers cannot be increased anymore, you can further redistribute the
received data by explicitly repartitioning the input stream (or the union of multi‐
ple streams) using DStream.repartition .
Increasing parallelism in aggregation
For operations like reduceByKey() , you can specify the parallelism as a second
parameter, as already discussed for RDDs.
Garbage Collection and Memory Usage
Another aspect that can cause problems is Java's garbage collection. You can mini‐
mize unpredictably large pauses due to GC by enabling Java's Concurrent Mark-
Sweep garbage collector. The Concurrent Mark-Sweep garbage collector does
consume more resources overall, but introduces fewer pauses.
We can control the GC by adding -XX:+UseConcMarkSweepGC to the spark.execu
tor.extraJavaOptions configuration parameter . Example 10-46 shows this with
spark-submit .
Example 10-46. Enable the Concurrent Mark-Sweep GC
spark-submit --conf spark.executor.extraJavaOptions = -XX:+UseConcMarkSweepGC App.jar
In addition to using a garbage collector less likely to introduce pauses, you can make
a big difference by reducing GC pressure. Caching RDDs in serialized form (instead
of as native objects) also reduces GC pressure, which is why, by default, RDDs gener‐
ated by Spark Streaming are stored in serialized form. Using Kryo serialization fur‐
ther reduces the memory required for the in-memory representation of cached data.
Search WWH ::




Custom Search