Database Reference
In-Depth Information
Figure 10-5. Execution of Spark Streaming within Spark's components
Spark Streaming offers the same fault-tolerance properties for DStreams as Spark has
for RDDs: as long as a copy of the input data is still available, it can recompute any
state derived from it using the lineage of the RDDs (i.e., by rerunning the operations
used to process it). By default, received data is replicated across two nodes, as men‐
tioned, so Spark Streaming can tolerate single worker failures. Using just lineage,
however, recomputation could take a long time for data that has been built up since
the beginning of the program. Thus, Spark Streaming also includes a mechanism
called checkpointing that saves state periodically to a reliable filesystem (e.g., HDFS or
S3). Typically, you might set up checkpointing every 5-10 batches of data. When
recovering lost data, Spark Streaming needs only to go back to the last checkpoint.
In the rest of this chapter, we will explore the transformations, output operations,
and input sources in Spark Streaming in detail. We will then return to fault tolerance
and checkpointing to explain how to configure programs for 24/7 operation.
Transformations
Transformations on DStreams can be grouped into either stateless or stateful :
• In stateless transformations the processing of each batch does not depend on the
data of its previous batches. They include the common RDD transformations we
have seen in Chapters 3 and 4 , like map() , filter() , and reduceByKey() .
Stateful transformations , in contrast, use data or intermediate results from previ‐
ous batches to compute the results of the current batch. They include transfor‐
mations based on sliding windows and on tracking state across time.
 
Search WWH ::




Custom Search