Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

The first step to a Spark program in Java

The Java API is very similar in principle to the Scala API. However, while Scala can call

the Java code quite easily, in some cases, it is not possible to call the Scala code from Java.

This is particularly the case when such Scala code makes use of certain Scala features such

as implicit conversions, default parameters, and the Scala reflection API.

Spark makes heavy use of these features in general, so it is necessary to have a separate

API specifically for Java that includes Java versions of the common classes. Hence,

SparkContext becomes JavaSparkContext , and RDD becomes JavaRDD .

Java versions prior to version 8 do not support anonymous functions and do not have suc-

cinct syntax for functional-style programming, so functions in the Spark Java API must im-

plement a WrappedFunction interface with the call method signature. While it is sig-

nificantly more verbose, we will often create one-off anonymous classes to pass to our

Spark operations, which implement this interface and the call method, to achieve much

the same effect as anonymous functions in Scala.

Spark provides support for Java 8's anonymous function (or lambda ) syntax. Using this

syntax makes a Spark program written in Java 8 look very close to the equivalent Scala

program.

In Scala, an RDD of key/value pairs provides special operators (such as reduceByKey

and saveAsSequenceFile , for example) that are accessed automatically via implicit

conversions. In Java, special types of JavaRDD classes are required in order to access sim-

ilar functions. These include JavaPairRDD to work with key/value pairs and

JavaDoubleRDD to work with numerical records.

Tip

In this section, we covered the standard Java API syntax. For more details and examples re-

lated to working RDDs in Java as well as the Java 8 lambda syntax, see the Java sections of

the Spark Programming Guide found at http://spark.apache.org/docs/latest/programming-

guide.html#rdd-operations .

We will see examples of most of these differences in the following Java program, which is

included in the example code of this chapter in the directory named java-spark-app .

The code directory also contains the CSV data file under the data subdirectory.

Search WWH ::

Custom Search

Home