Database Reference
In-Depth Information
}
);
maxTemps
.
saveAsTextFile
(
args
[
1
]);
}
}
In Spark's Java API, an RDD is represented by an instance of
JavaRDD
, or
JavaPairRDD
for the special case of an RDD of key-value pairs. Both of these classes
implement the
JavaRDDLike
interface, where most of the methods for working with
RDDs can be found (when viewing class documentation, for example).
Running the program is identical to running the Scala version, except the classname is
MaxTemperatureSpark
.
A Python Example
Spark also has language support for Python, in an API called
PySpark
. By taking advant-
age of Python's lambda expressions, we can rewrite the example program in a way that
closely mirrors the Scala equivalent, as shown in
Example 19-3
.
Example 19-3. Python application to find the maximum temperature, using PySpark
from
pyspark
import
SparkContext
import
re
,
sys
sc
=
SparkContext
(
"local"
,
"Max Temperature"
)
sc
.
textFile
(
sys
.
argv
[
1
]) \
.
map
(
lambda
s
:
s
.
split
(
"
\t
"
)) \
.
filter
(
lambda
rec
: (
rec
[
1
] !=
"9999"
and
re
.
match
(
"[01459]"
,
rec
[
2
]))) \
.
map
(
lambda
rec
: (
int
(
rec
[
0
]),
int
(
rec
[
1
]))) \
.
reduceByKey
(
max
) \
.
saveAsTextFile
(
sys
.
argv
[
2
])
Notice that for the
reduceByKey()
transformation we can use Python's built-in
max
function.
The important thing to note is that this program is written in regular CPython. Spark will
fork Python subprocesses to run the user's Python code (both in the launcher program and
on
executors
that run user tasks in the cluster), and uses a socket to connect the two pro-
cesses so the parent can pass RDD partition data to be processed by the Python code.
To run, we specify the Python file rather than the application JAR: