Spark Streaming - Learning Spark

Database Reference

In-Depth Information

your environment. One place where Spark provides more support is the Standalone

cluster manager, which supports a --supervise flag when submitting your driver

that lets Spark restart it. You will also need to pass --deploy-mode cluster to make

the driver run within the cluster and not on your local machine, as shown in

Example 10-45 .

Example 10-45. Launching a driver in supervise mode

./bin/spark-submit --deploy-mode cluster --supervise --master spark://... App.jar

When using this option, you will also want the Spark Standalone master to be fault-

tolerant. You can configure this using ZooKeeper, as described in the Spark docu‐

mentation . With this setup, your application will have no single point of failure.

Finally, note that when the driver crashes, executors in Spark will also restart. This

may be changed in future Spark versions, but it is expected behavior in 1.2 and earlier

versions, as the executors are not able to continue processing data without a driver.

Your relaunched driver will start new executors to pick up where it left off.

Worker Fault Tolerance

For failure of a worker node, Spark Streaming uses the same techniques as Spark for

its fault tolerance. All the data received from external sources is replicated among the

Spark workers. All RDDs created through transformations of this replicated input

data are tolerant to failure of a worker node, as the RDD lineage allows the system to

recompute the lost data all the way from the surviving replica of the input data.

Receiver Fault Tolerance

The fault tolerance of the workers running the receivers is another important consid‐

eration. In such a failure, Spark Streaming restarts the failed receivers on other nodes

in the cluster. However, whether it loses any of the received data depends on the

nature of the source (whether the source can resend data or not) and the implemen‐

tation of the receiver (whether it updates the source about received data or not). For

example, with Flume, one of the main differences between the two receivers is the

data loss guarantees. With the receiver-pull-from-sink model, Spark removes the ele‐

ments only once they have been replicated inside Spark. For the push-to-receiver

model, if the receiver fails before the data is replicated some data can be lost. In gen‐

eral, for any receiver, you must also consider the fault-tolerance properties of the

upstream source (transactional, or not) for ensuring zero data loss.

In general, receivers provide the following guarantees:

• All data read from a reliable filesystem (e.g., with StreamingContext.hadoop

Files ) is reliable, because the underlying filesystem is replicated. Spark

Search WWH ::

Custom Search

Home