Database Reference
In-Depth Information
Streaming will remember which data it processed in its checkpoints and will pick
up again where it left off if your application crashes.
• For unreliable sources such as Kafka, push-based Flume, or Twitter, Spark repli‐
cates the input data to other nodes, but it can briefly lose data if a receiver task is
down. In Spark 1.1 and earlier, received data was replicated in-memory only to
executors, so it could also be lost if the driver crashed (in which case all executors
disconnect). In Spark 1.2, received data can be logged to a reliable filesystem like
HDFS so that it is not lost on driver restart.
To summarize, therefore, the best way to ensure all data is processed is to use a relia‐
ble input source (e.g., HDFS or pull-based Flume). This is generally also a best prac‐
tice if you need to process the data later in batch jobs: it ensures that your batch jobs
and streaming jobs will see the same data and produce the same results.
Processing Guarantees
Due to Spark Streaming's worker fault-tolerance guarantees, it can provide exactly-
once semantics for all transformations—even if a worker fails and some data gets
reprocessed, the final transformed result (that is, the transformed RDDs) will be the
same as if the data were processed exactly once.
However, when the transformed result is to be pushed to external systems using out‐
put operations, the task pushing the result may get executed multiple times due to
failures, and some data can get pushed multiple times. Since this involves external
systems, it is up to the system-specific code to handle this case. We can either use
transactions to push to external systems (that is, atomically push one RDD partition
at a time), or design updates to be idempotent operations (such that multiple runs of
an update still produce the same result). For example, Spark Streaming's
saveAs...File operations automatically make sure only one copy of each output file
exists, by atomically moving a file to its final destination when it is complete.
Streaming UI
Spark Streaming provides a special UI page that lets us look at what applications are
doing. This is available in a Streaming tab on the normal Spark UI (typically http://
<driver>:4040 ) . A sample screenshot is shown in Figure 10-9 .
Search WWH ::




Custom Search