Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

The first step to a Spark program in

Python

Spark's Python API exposes virtually all the functionalities of Spark's Scala API in the

Python language. There are some features that are not yet supported (for example, graph

processing with GraphX and a few API methods here and there). See the Python section of

the Spark Programming Guide ( http://spark.apache.org/docs/latest/programming-

guide.html ) for more details.

Following on from the preceding examples, we will now write a Python version. We as-

sume that you have Python version 2.6 and higher installed on your system (for example,

most Linux and Mac OS X systems come with Python preinstalled).

The example program is included in the sample code for this chapter, in the directory

named python-spark-app , which also contains the CSV data file under the data sub-

directory. The project contains a script, pythonapp.py , provided here:

"""A simple Spark app in Python"""

from pyspark import SparkContext

sc = SparkContext("local[2]", "First Spark App")

# we take the raw data in CSV format and convert it into a

set of records of the form (user, product, price)

data = sc.textFile("data/

UserPurchaseHistory.csv").map(lambda line:

line.split(",")).map(lambda record: (record[0], record[1],

record[2]))

# let's count the number of purchases

numPurchases = data.count()

# let's count how many unique users made purchases

uniqueUsers = data.map(lambda record:

record[0]).distinct().count()

# let's sum up our total revenue

totalRevenue = data.map(lambda record:

float(record[2])).sum()

# let's find our most popular product

products = data.map(lambda record: (record[1],

Search WWH ::

Custom Search

Home