Database Reference
In-Depth Information
Stateful transformations require checkpointing to be enabled in your StreamingCon‐
text for fault tolerance. We will discuss checkpointing in more detail in “24/7 Opera‐
tion” on page 205 , but for now, you can enable it by passing a directory to
ssc.checkpoint() , as shown in Example 10-16 .
Example 10-16. Setting up checkpointing
ssc . checkpoint ( "hdfs://..." )
For local development, you can also use a local path (e.g., /tmp ) instead of HDFS.
Windowed transformations
Windowed operations compute results across a longer time period than the Stream‐
ingContext's batch interval, by combining results from multiple batches. In this sec‐
tion, we'll show how to use them to keep track of the most common response codes,
content sizes, and clients in a web server access log.
All windowed operations need two parameters, window duration and sliding dura‐
tion, both of which must be a multiple of the StreamingContext's batch interval. The
window duration controls how many previous batches of data are considered, namely
the last windowDuration / batchInterval . If we had a source DStream with a batch
interval of 10 seconds and wanted to create a sliding window of the last 30 seconds
(or last 3 batches) we would set the windowDuration to 30 seconds. The sliding dura‐
tion, which defaults to the batch interval, controls how frequently the new DStream
computes results. If we had the source DStream with a batch interval of 10 seconds
and wanted to compute our window only on every second batch, we would set our
sliding interval to 20 seconds. Figure 10-6 shows an example.
The simplest window operation we can do on a DStream is window() , which returns
a new DStream with the data for the requested window. In other words, each RDD in
the DStream resulting from window() will contain data from multiple batches, which
we can then process with count() , transform() , and so on. (See Examples 10-17 and
10-18 .)
Search WWH ::




Custom Search