Spark Streaming - Learning Spark

Database Reference

In-Depth Information

This output nicely illustrates the micro-batch architecture of Spark Streaming. We

can see the filtered logs being printed every second, since we set the batch interval to

1 second when we created the StreamingContext. The Spark UI also shows that Spark

Streaming is running many small jobs, as you can see in Figure 10-4 .

Figure 10-4. Spark application UI when running a streaming job

Apart from transformations, DStreams support output operations , such as the

print() used in our example. Output operations are similar to RDD actions in that

they write data to an external system, but in Spark Streaming they run periodically on

each time step, producing output in batches.

The execution of Spark Streaming within Spark's driver-worker components is

shown in Figure 10-5 (see Figure 2-3 earlier in the topic for the components of

Spark). For each input source, Spark Streaming launches receivers , which are tasks

running within the application's executors that collect data from the input source and

save it as RDDs. These receive the input data and replicate it (by default) to another

executor for fault tolerance. This data is stored in the memory of the executors in the

same way as cached RDDs. 1 The StreamingContext in the driver program then peri‐

odically runs Spark jobs to process this data and combine it with RDDs from previ‐

ous time steps.

1 In Spark 1.2, receivers can also replicate data to HDFS. Also, some input sources, such as HDFS, are naturally

replicated, so Spark Streaming does not replicate those again.

Search WWH ::

Custom Search

Home