Database Reference
In-Depth Information
}
);
maxTemps . saveAsTextFile ( args [ 1 ]);
}
}
In Spark's Java API, an RDD is represented by an instance of JavaRDD , or
JavaPairRDD for the special case of an RDD of key-value pairs. Both of these classes
implement the JavaRDDLike interface, where most of the methods for working with
RDDs can be found (when viewing class documentation, for example).
Running the program is identical to running the Scala version, except the classname is
MaxTemperatureSpark .
A Python Example
Spark also has language support for Python, in an API called PySpark . By taking advant-
age of Python's lambda expressions, we can rewrite the example program in a way that
closely mirrors the Scala equivalent, as shown in Example 19-3 .
Example 19-3. Python application to find the maximum temperature, using PySpark
from pyspark import SparkContext
import re , sys
sc = SparkContext ( "local" , "Max Temperature" )
sc . textFile ( sys . argv [ 1 ]) \
. map ( lambda s : s . split ( " \t " )) \
. filter ( lambda rec : ( rec [ 1 ] != "9999" and re . match ( "[01459]" ,
rec [ 2 ]))) \
. map ( lambda rec : ( int ( rec [ 0 ]), int ( rec [ 1 ]))) \
. reduceByKey ( max ) \
. saveAsTextFile ( sys . argv [ 2 ])
Notice that for the reduceByKey() transformation we can use Python's built-in max
function.
The important thing to note is that this program is written in regular CPython. Spark will
fork Python subprocesses to run the user's Python code (both in the launcher program and
on executors that run user tasks in the cluster), and uses a socket to connect the two pro-
cesses so the parent can pass RDD partition data to be processed by the Python code.
To run, we specify the Python file rather than the application JAR:
Search WWH ::




Custom Search