Database Reference
In-Depth Information
This output nicely illustrates the micro-batch architecture of Spark Streaming. We
can see the filtered logs being printed every second, since we set the batch interval to
1 second when we created the StreamingContext. The Spark UI also shows that Spark
Streaming is running many small jobs, as you can see in Figure 10-4 .
Figure 10-4. Spark application UI when running a streaming job
Apart from transformations, DStreams support output operations , such as the
print() used in our example. Output operations are similar to RDD actions in that
they write data to an external system, but in Spark Streaming they run periodically on
each time step, producing output in batches.
The execution of Spark Streaming within Spark's driver-worker components is
shown in Figure 10-5 (see Figure 2-3 earlier in the topic for the components of
Spark). For each input source, Spark Streaming launches receivers , which are tasks
running within the application's executors that collect data from the input source and
save it as RDDs. These receive the input data and replicate it (by default) to another
executor for fault tolerance. This data is stored in the memory of the executors in the
same way as cached RDDs. 1 The StreamingContext in the driver program then peri‐
odically runs Spark jobs to process this data and combine it with RDDs from previ‐
ous time steps.
1 In Spark 1.2, receivers can also replicate data to HDFS. Also, some input sources, such as HDFS, are naturally
replicated, so Spark Streaming does not replicate those again.
 
Search WWH ::




Custom Search