Function name
Equivalent function*<A, B,…>
PairFlatMapFunction<T, K, V> Function<T, Iterable<Tuple2<K, V>>> PairRDD<K, V>
from a
PairFunction<T, K, V>
Function<T, Tuple2<K, V>>
PairRDD<K, V>
from a mapToPair
We can modify Example 3-28 , where we squared an RDD of numbers, to produce a
JavaDoubleRDD , as shown in Example 3-38 . This gives us access to the additional Dou
bleRDD specific functions like mean() and variance() .
Example 3-38. Creating DoubleRDD in Java
JavaDoubleRDD result = rdd . mapToDouble (
new DoubleFunction < Integer >() {
public double call ( Integer x ) {
return ( double ) x * x ;
System . out . println ( result . mean ());
The Python API is structured differently than Java and Scala. In Python all of the
functions are implemented on the base RDD class but will fail at runtime if the type
of data in the RDD is incorrect.
Persistence (Caching)
As discussed earlier, Spark RDDs are lazily evaluated, and sometimes we may wish to
use the same RDD multiple times. If we do this naively, Spark will recompute the
RDD and all of its dependencies each time we call an action on the RDD. This can be
especially expensive for iterative algorithms, which look at the data many times.
Another trivial example would be doing a count and then writing out the same RDD,
as shown in Example 3-39 .
Example 3-39. Double execution in Scala
val result = input . map ( x => x * x )
println ( result . count ())
println ( result . collect (). mkString ( "," ))
To avoid computing an RDD multiple times, we can ask Spark to persist the data.
When we ask Spark to persist an RDD, the nodes that compute the RDD store their
