Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

}

);

maxTemps . saveAsTextFile ( args [ 1 ]);

}

In Spark's Java API, an RDD is represented by an instance of JavaRDD , or

JavaPairRDD for the special case of an RDD of key-value pairs. Both of these classes

implement the JavaRDDLike interface, where most of the methods for working with

RDDs can be found (when viewing class documentation, for example).

Running the program is identical to running the Scala version, except the classname is

MaxTemperatureSpark .

A Python Example

Spark also has language support for Python, in an API called PySpark . By taking advant-

age of Python's lambda expressions, we can rewrite the example program in a way that

closely mirrors the Scala equivalent, as shown in Example 19-3 .

Example 19-3. Python application to find the maximum temperature, using PySpark

from pyspark import SparkContext

import re , sys

sc = SparkContext ( "local" , "Max Temperature" )

sc . textFile ( sys . argv [ 1 ]) \

. map ( lambda s : s . split ( " \t " )) \

. filter ( lambda rec : ( rec [ 1 ] != "9999" and re . match ( "[01459]" ,

rec [ 2 ]))) \

. map ( lambda rec : ( int ( rec [ 0 ]), int ( rec [ 1 ]))) \

. reduceByKey ( max ) \

. saveAsTextFile ( sys . argv [ 2 ])

Notice that for the reduceByKey() transformation we can use Python's built-in max

function.

The important thing to note is that this program is written in regular CPython. Spark will

fork Python subprocesses to run the user's Python code (both in the launcher program and

on executors that run user tasks in the cluster), and uses a socket to connect the two pro-

cesses so the parent can pass RDD partition data to be processed by the Python code.

To run, we specify the Python file rather than the application JAR:

Search WWH ::

Custom Search

Home