Database Reference
In-Depth Information
Example 3-32. reduce() in Python
sum
=
rdd
.
reduce
(
lambda
x
,
y
:
x
+
y
)
Example 3-33. reduce() in Scala
val
sum
=
rdd
.
reduce
((
x
,
y
)
=>
x
+
y
)
Example 3-34. reduce() in Java
Integer
sum
=
rdd
.
reduce
(
new
Function2
<
Integer
,
Integer
,
Integer
>()
{
public
Integer
call
(
Integer
x
,
Integer
y
)
{
return
x
+
y
;
}
});
Similar to
reduce()
is
fold()
, which also takes a function with the same signature as
needed for
reduce()
, but in addition takes a “zero value” to be used for the initial call
on each partition. The zero value you provide should be the identity element for your
operation; that is, applying it multiple times with your function should not change
the value (e.g., 0 for +, 1 for *, or an empty list for concatenation).
You can minimize object creation in
fold()
by modifying and
returning the first of the two parameters in place. However, you
should not modify the second parameter.
Both
fold()
and
reduce()
require that the return type of our result be the same type
as that of the elements in the RDD we are operating over. This works well for opera‐
tions like
sum
, but sometimes we want to return a different type. For example, when
computing a running average, we need to keep track of both the count so far and the
number of elements, which requires us to return a pair. We could work around this
by first using
map()
where we transform every element into the element and the
number 1, which is the type we want to return, so that the
reduce()
function can
work on pairs.
The
aggregate()
function frees us from the constraint of having the return be the
same type as the RDD we are working on. With
aggregate()
, like
fold()
, we supply
an initial zero value of the type we want to return. We then supply a function to com‐
bine the elements from our RDD with the accumulator. Finally, we need to supply a
second function to merge two accumulators, given that each node accumulates its
own results locally.
We can use
aggregate()
to compute the average of an RDD, avoiding a
map()
before
the
fold()
, as shown in Examples
3-35
through
3-37
.