Spark Streaming - Learning Spark

Database Reference

In-Depth Information

Stateless Transformations

Stateless transformations, some of which are listed in Table 10-1 , are simple RDD

transformations being applied on every batch—that is, every RDD in a DStream. We

have already seen filter() in Figure 10-3 . Many of the RDD transformations dis‐

cussed in Chapters 3 and 4 are also available on DStreams. Note that key/value

DStream transformations like reduceByKey() are made available in Scala by import

StreamingContext._ . In Java, as with RDDs, it is necessary to create a JavaPairD

Stream using mapToPair() .

Table 10-1. Examples of stateless DStream transformations (incomplete list)

Function name

Purpose

Scala example

Signature of user-

supplied function on

DStream[T]

Apply a function to each

element in the DStream and

return a DStream of the result.

map()

ds.map(x => x + 1)

f: (T) → U

Apply a function to each

element in the DStream and

return a DStream of the

contents of the iterators

returned.

flatMap()

ds.flatMap(x => x.split(" "))

f: T → Itera

ble[U]

Return a DStream consisting of

only elements that pass the

condition passed to filter.

filter()

ds.filter(x => x != 1)

f: T → Boolean

Change the number of

partitions of the DStream.

N/A

reparti

tion()

ds.repartition(10)

Combine values with the same

key in each batch.

reduceBy

Key()

ds.reduceByKey(

(x, y) => x + y)

f: T, T → T

Group values with the same

key in each batch.

N/A

groupBy

Key()

ds.groupByKey()

Keep in mind that although these functions look like they're applying to the whole

stream, internally each DStream is composed of multiple RDDs (batches), and each

stateless transformation applies separately to each RDD. For example, reduceByKey()

will reduce data within each time step, but not across time steps. The stateful trans‐

formations we cover later allow combining data across time.

Learning Spark

Search WWH ::

Custom Search

Home