Database Reference
In-Depth Information
Example 3-32. reduce() in Python
sum = rdd . reduce ( lambda x , y : x + y )
Example 3-33. reduce() in Scala
val sum = rdd . reduce (( x , y ) => x + y )
Example 3-34. reduce() in Java
Integer sum = rdd . reduce ( new Function2 < Integer , Integer , Integer >() {
public Integer call ( Integer x , Integer y ) { return x + y ; }
});
Similar to reduce() is fold() , which also takes a function with the same signature as
needed for reduce() , but in addition takes a “zero value” to be used for the initial call
on each partition. The zero value you provide should be the identity element for your
operation; that is, applying it multiple times with your function should not change
the value (e.g., 0 for +, 1 for *, or an empty list for concatenation).
You can minimize object creation in fold() by modifying and
returning the first of the two parameters in place. However, you
should not modify the second parameter.
Both fold() and reduce() require that the return type of our result be the same type
as that of the elements in the RDD we are operating over. This works well for opera‐
tions like sum , but sometimes we want to return a different type. For example, when
computing a running average, we need to keep track of both the count so far and the
number of elements, which requires us to return a pair. We could work around this
by first using map() where we transform every element into the element and the
number 1, which is the type we want to return, so that the reduce() function can
work on pairs.
The aggregate() function frees us from the constraint of having the return be the
same type as the RDD we are working on. With aggregate() , like fold() , we supply
an initial zero value of the type we want to return. We then supply a function to com‐
bine the elements from our RDD with the accumulator. Finally, we need to supply a
second function to merge two accumulators, given that each node accumulates its
own results locally.
We can use aggregate() to compute the average of an RDD, avoiding a map() before
the fold() , as shown in Examples 3-35 through 3-37 .
Search WWH ::




Custom Search