Programming with RDDs - Learning Spark

Database Reference

In-Depth Information

Function name

Equivalent function*<A, B,…>

Usage

PairFlatMapFunction<T, K, V> Function<T, Iterable<Tuple2<K, V>>> PairRDD<K, V>

from a

flatMapToPair

PairFunction<T, K, V>

Function<T, Tuple2<K, V>>

PairRDD<K, V>

from a mapToPair

We can modify Example 3-28 , where we squared an RDD of numbers, to produce a

JavaDoubleRDD , as shown in Example 3-38 . This gives us access to the additional Dou

bleRDD specific functions like mean() and variance() .

Example 3-38. Creating DoubleRDD in Java

JavaDoubleRDD result = rdd . mapToDouble (

new DoubleFunction < Integer >() {

public double call ( Integer x ) {

return ( double ) x * x ;

}

});

System . out . println ( result . mean ());

Python

The Python API is structured differently than Java and Scala. In Python all of the

functions are implemented on the base RDD class but will fail at runtime if the type

of data in the RDD is incorrect.

Persistence (Caching)

As discussed earlier, Spark RDDs are lazily evaluated, and sometimes we may wish to

use the same RDD multiple times. If we do this naively, Spark will recompute the

RDD and all of its dependencies each time we call an action on the RDD. This can be

especially expensive for iterative algorithms, which look at the data many times.

Another trivial example would be doing a count and then writing out the same RDD,

as shown in Example 3-39 .

Example 3-39. Double execution in Scala

val result = input . map ( x => x * x )

println ( result . count ())

println ( result . collect (). mkString ( "," ))

To avoid computing an RDD multiple times, we can ask Spark to persist the data.

When we ask Spark to persist an RDD, the nodes that compute the RDD store their

Search WWH ::

Custom Search

Home