Database Reference
In-Depth Information
Example 4-6. Simple filter on second element in Java
Function < Tuple2 < String , String >, Boolean > longWordFilter =
new Function < Tuple2 < String , String >, Boolean >() {
public Boolean call ( Tuple2 < String , String > keyValue ) {
return ( keyValue . _2 (). length () < 20 );
}
};
JavaPairRDD < String , String > result = pairs . filter ( longWordFilter );
Figure 4-1. Filter on value
Sometimes working with pairs can be awkward if we want to access only the value
part of our pair RDD. Since this is a common pattern, Spark provides the mapVal
ues(func) function, which is the same as map{case (x, y): (x, func(y))} . We
will use this function in many of our examples.
We now discuss each of the families of pair RDD functions, starting with
aggregations.
Aggregations
When datasets are described in terms of key/value pairs, it is common to want to
aggregate statistics across all elements with the same key. We have looked at the
fold() , combine() , and reduce() actions on basic RDDs, and similar per-key trans‐
formations exist on pair RDDs. Spark has a similar set of operations that combines
values that have the same key. These operations return RDDs and thus are transfor‐
mations rather than actions.
reduceByKey() is quite similar to reduce() ; both take a function and use it to com‐
bine values. reduceByKey() runs several parallel reduce operations, one for each key
in the dataset, where each operation combines values that have the same key. Because
datasets can have very large numbers of keys, reduceByKey() is not implemented as
an action that returns a value to the user program. Instead, it returns a new RDD
consisting of each key and the reduced value for that key.
foldByKey() is quite similar to fold() ; both use a zero value of the same type of the
data in our RDD and combination function. As with fold() , the provided zero value
 
Search WWH ::




Custom Search