Spark Streaming - Learning Spark

Database Reference

In-Depth Information

Figure 10-5. Execution of Spark Streaming within Spark's components

Spark Streaming offers the same fault-tolerance properties for DStreams as Spark has

for RDDs: as long as a copy of the input data is still available, it can recompute any

state derived from it using the lineage of the RDDs (i.e., by rerunning the operations

used to process it). By default, received data is replicated across two nodes, as men‐

tioned, so Spark Streaming can tolerate single worker failures. Using just lineage,

however, recomputation could take a long time for data that has been built up since

the beginning of the program. Thus, Spark Streaming also includes a mechanism

called checkpointing that saves state periodically to a reliable filesystem (e.g., HDFS or

S3). Typically, you might set up checkpointing every 5-10 batches of data. When

recovering lost data, Spark Streaming needs only to go back to the last checkpoint.

In the rest of this chapter, we will explore the transformations, output operations,

and input sources in Spark Streaming in detail. We will then return to fault tolerance

and checkpointing to explain how to configure programs for 24/7 operation.

Transformations

Transformations on DStreams can be grouped into either stateless or stateful :

• In stateless transformations the processing of each batch does not depend on the

data of its previous batches. They include the common RDD transformations we

have seen in Chapters 3 and 4 , like map() , filter() , and reduceByKey() .

• Stateful transformations , in contrast, use data or intermediate results from previ‐

ous batches to compute the results of the current batch. They include transfor‐

mations based on sliding windows and on tracking state across time.

Search WWH ::

Custom Search

Home