Database Reference
In-Depth Information
24/7 Operation
One of the main advantages of Spark Streaming is that it provides strong fault toler‐
ance guarantees. As long as the input data is stored reliably, Spark Streaming will
always compute the correct result from it, offering “exactly once” semantics (i.e., as if
all of the data was processed without any nodes failing), even if workers or the driver
fail.
To run Spark Streaming applications 24/7, you need some special setup. The first step
is setting up checkpointing to a reliable storage system, such as HDFS or Amazon S3. 3
In addition, we need to worry about the fault tolerance of the driver program (which
requires special setup code) and of unreliable input sources. This section covers how
to perform this setup.
Checkpointing
Checkpointing is the main mechanism that needs to be set up for fault tolerance in
Spark Streaming. It allows Spark Streaming to periodically save data about the appli‐
cation to a reliable storage system, such as HDFS or Amazon S3, for use in recover‐
ing. Specifically, checkpointing serves two purposes:
• Limiting the state that must be recomputed on failure. As discussed in “Architec‐
ture and Abstraction” on page 186 , Spark Streaming can recompute state using
the lineage graph of transformations, but checkpointing controls how far back it
must go.
• Providing fault tolerance for the driver. If the driver program in a streaming
application crashes, you can launch it again and tell it to recover from a check‐
point, in which case Spark Streaming will read how far the previous run of the
program got in processing the data and take over from there.
For these reasons, checkpointing is important to set up in any production streaming
application. You can set it by passing a path (either HDFS, S3, or local filesystem) to
the ssc.checkpoint() method, as shown in Example 10-42 .
Example 10-42. Setting up checkpointing
ssc . checkpoint ( "hdfs://..." )
Note that even in local mode, Spark Streaming will complain if you try to run a state‐
ful operation without checkpointing enabled. In that case, you can pass a local
3 We do not cover how to set up one of these filesystems, but they come in many Hadoop or cloud environ‐
ments. When deploying on your own cluster, it is probably easiest to set up HDFS.
 
Search WWH ::




Custom Search