Database Reference
In-Depth Information
The first step to a Spark program in Java
The Java API is very similar in principle to the Scala API. However, while Scala can call
the Java code quite easily, in some cases, it is not possible to call the Scala code from Java.
This is particularly the case when such Scala code makes use of certain Scala features such
as implicit conversions, default parameters, and the Scala reflection API.
Spark makes heavy use of these features in general, so it is necessary to have a separate
API specifically for Java that includes Java versions of the common classes. Hence,
SparkContext
becomes
JavaSparkContext
, and
RDD
becomes
JavaRDD
.
Java versions prior to version 8 do not support anonymous functions and do not have suc-
cinct syntax for functional-style programming, so functions in the Spark Java API must im-
plement a
WrappedFunction
interface with the
call
method signature. While it is sig-
nificantly more verbose, we will often create one-off anonymous classes to pass to our
Spark operations, which implement this interface and the
call
method, to achieve much
the same effect as anonymous functions in Scala.
Spark provides support for Java 8's anonymous function (or
lambda
) syntax. Using this
syntax makes a Spark program written in Java 8 look very close to the equivalent Scala
program.
In Scala, an RDD of key/value pairs provides special operators (such as
reduceByKey
and
saveAsSequenceFile
, for example) that are accessed automatically via implicit
conversions. In Java, special types of
JavaRDD
classes are required in order to access sim-
ilar functions. These include
JavaPairRDD
to work with key/value pairs and
JavaDoubleRDD
to work with numerical records.
Tip
In this section, we covered the standard Java API syntax. For more details and examples re-
lated to working RDDs in Java as well as the Java 8 lambda syntax, see the Java sections of
the
Spark Programming Guide
found at
http://spark.apache.org/docs/latest/programming-
We will see examples of most of these differences in the following Java program, which is
included in the example code of this chapter in the directory named
java-spark-app
.
The code directory also contains the CSV data file under the
data
subdirectory.