Spark Streaming - Learning Spark

Database Reference

In-Depth Information

Streaming will remember which data it processed in its checkpoints and will pick

up again where it left off if your application crashes.

• For unreliable sources such as Kafka, push-based Flume, or Twitter, Spark repli‐

cates the input data to other nodes, but it can briefly lose data if a receiver task is

down. In Spark 1.1 and earlier, received data was replicated in-memory only to

executors, so it could also be lost if the driver crashed (in which case all executors

disconnect). In Spark 1.2, received data can be logged to a reliable filesystem like

HDFS so that it is not lost on driver restart.

To summarize, therefore, the best way to ensure all data is processed is to use a relia‐

ble input source (e.g., HDFS or pull-based Flume). This is generally also a best prac‐

tice if you need to process the data later in batch jobs: it ensures that your batch jobs

and streaming jobs will see the same data and produce the same results.

Processing Guarantees

Due to Spark Streaming's worker fault-tolerance guarantees, it can provide exactly-

once semantics for all transformations—even if a worker fails and some data gets

reprocessed, the final transformed result (that is, the transformed RDDs) will be the

same as if the data were processed exactly once.

However, when the transformed result is to be pushed to external systems using out‐

put operations, the task pushing the result may get executed multiple times due to

failures, and some data can get pushed multiple times. Since this involves external

systems, it is up to the system-specific code to handle this case. We can either use

transactions to push to external systems (that is, atomically push one RDD partition

at a time), or design updates to be idempotent operations (such that multiple runs of

an update still produce the same result). For example, Spark Streaming's

saveAs...File operations automatically make sure only one copy of each output file

exists, by atomically moving a file to its final destination when it is complete.

Streaming UI

Spark Streaming provides a special UI page that lets us look at what applications are

doing. This is available in a Streaming tab on the normal Spark UI (typically http://

<driver>:4040 ) . A sample screenshot is shown in Figure 10-9 .

Search WWH ::

Custom Search

Home