Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

Example 4-6. Simple filter on second element in Java

Function < Tuple2 < String , String >, Boolean > longWordFilter =

new Function < Tuple2 < String , String >, Boolean >() {

public Boolean call ( Tuple2 < String , String > keyValue ) {

return ( keyValue . _2 (). length () < 20 );

}

};

JavaPairRDD < String , String > result = pairs . filter ( longWordFilter );

Figure 4-1. Filter on value

Sometimes working with pairs can be awkward if we want to access only the value

part of our pair RDD. Since this is a common pattern, Spark provides the mapVal

ues(func) function, which is the same as map{case (x, y): (x, func(y))} . We

will use this function in many of our examples.

We now discuss each of the families of pair RDD functions, starting with

aggregations.

Aggregations

When datasets are described in terms of key/value pairs, it is common to want to

aggregate statistics across all elements with the same key. We have looked at the

fold() , combine() , and reduce() actions on basic RDDs, and similar per-key trans‐

formations exist on pair RDDs. Spark has a similar set of operations that combines

values that have the same key. These operations return RDDs and thus are transfor‐

mations rather than actions.

reduceByKey() is quite similar to reduce() ; both take a function and use it to com‐

bine values. reduceByKey() runs several parallel reduce operations, one for each key

in the dataset, where each operation combines values that have the same key. Because

datasets can have very large numbers of keys, reduceByKey() is not implemented as

an action that returns a value to the user program. Instead, it returns a new RDD

consisting of each key and the reduced value for that key.

foldByKey() is quite similar to fold() ; both use a zero value of the same type of the

data in our RDD and combination function. As with fold() , the provided zero value

Search WWH ::

Custom Search

Home