Database Reference
In-Depth Information
Note
See the documentation on input sources at http://spark.apache.org/docs/latest/streaming-
programming-guide.html#input-dstreams for more details and for links to various ad-
vanced sources.
Transformations
As we saw in Chapter 1 , Getting Up and Running with Spark , and throughout this topic,
Spark allows us to apply powerful transformations to RDDs. As DStreams are made up of
RDDs, Spark Streaming provides a set of transformations available on DStreams; these
transformations are similar to those available on RDDs. These include map , flatMap ,
filter , join , and reduceByKey .
Spark Streaming transformations, such as those applicable to RDDs, operate on each ele-
ment of a DStream's underlying data. That is, the transformations are effectively applied
to each RDD in the DStream, which, in turn, applies the transformation to the elements of
the RDD.
Spark Streaming also provides operators such as reduce and count . These operators
return a DStream made up of a single element (for example, the count value for each
batch). Unlike the equivalent operators on RDDs, these do not trigger computation on
DStreams directly. That is, they are not actions , but they are still transformations, as they
return another DStream.
Keeping track of state
When we were dealing with batch processing of RDDs, keeping and updating a state vari-
able was relatively straightforward. We could start with a certain state (for example, a
count or sum of values) and then use broadcast variables or accumulators to update this
state in parallel. Usually, we would then use an RDD action to collect the updated state to
the driver and, in turn, update the global state.
With DStreams, this is a little more complex, as we need to keep track of states across
batches in a fault-tolerant manner. Conveniently, Spark Streaming provides the up-
dateStateByKey function on a DStream of key-value pairs, which takes care of this
for us, allowing us to create a stream of arbitrary state information and update it with each
batch of data seen. For example, the state could be a global count of the number of times
each key has been seen. The state could, thus, represent the number of visits per web
page, clicks per advert, tweets per user, or purchases per product, for example.
Search WWH ::




Custom Search