Database Reference
In-Depth Information
Scala
In Scala the conversion to RDDs with special functions (e.g., to expose numeric func‐
tions on an
RDD[Double]
) is handled automatically using implicit conversions. As
mentioned in
“Initializing a SparkContext” on page 17
, we need to add
import
org.apache.spark.SparkContext._
for these conversions to work. You can see the
turn an RDD into various wrapper classes, such as
DoubleRDDFunctions
(for RDDs
of numeric data) and
PairRDDFunctions
(for key/value pairs), to expose additional
functions such as
mean()
and
variance()
.
Implicits, while quite powerful, can sometimes be confusing. If you call a function
there is no
mean()
function. The call manages to succeed because of implicit conver‐
sions between
RDD[Double]
and
DoubleRDDFunctions
. When searching for functions
on your RDD in Scaladoc, make sure to look at functions that are available in these
wrapper classes.
Java
In Java the conversion between the specialized types of RDDs is a bit more explicit. In
particular, there are special classes called
JavaDoubleRDD
and
JavaPairRDD
for RDDs
of these types, with extra methods for these types of data. This has the benefit of giv‐
ing you a greater understanding of what exactly is going on, but can be a bit more
cumbersome.
To construct RDDs of these special types, instead of always using the
Function
class
we will need to use specialized versions. If we want to create a
DoubleRDD
from an
RDD of type
T
, rather than using
Function<T, Double>
we use
DoubleFunction<T>
.
Table 3-5
shows the specialized functions and their uses.
We also need to call different functions on our RDD (so we can't just create a
Double
Function
and pass it to
map()
). When we want a
DoubleRDD
back, instead of calling
map()
, we need to call
mapToDouble()
with the same pattern all of the other functions
follow.
Table 3-5. Java interfaces for type-specific functions
Function name
Equivalent function*<A, B,…>
Usage
DoubleRDD
from a
flatMapToDouble
DoubleFlatMapFunction<T>
Function<T, Iterable<Double>>
DoubleRDD
from
map
ToDouble
DoubleFunction<T>
Function<T, double>