Database Reference
In-Depth Information
The first step to a Spark program in
Python
Spark's Python API exposes virtually all the functionalities of Spark's Scala API in the
Python language. There are some features that are not yet supported (for example, graph
processing with GraphX and a few API methods here and there). See the Python section of
the Spark Programming Guide ( http://spark.apache.org/docs/latest/programming-
guide.html ) for more details.
Following on from the preceding examples, we will now write a Python version. We as-
sume that you have Python version 2.6 and higher installed on your system (for example,
most Linux and Mac OS X systems come with Python preinstalled).
The example program is included in the sample code for this chapter, in the directory
named python-spark-app , which also contains the CSV data file under the data sub-
directory. The project contains a script, pythonapp.py , provided here:
"""A simple Spark app in Python"""
from pyspark import SparkContext
sc = SparkContext("local[2]", "First Spark App")
# we take the raw data in CSV format and convert it into a
set of records of the form (user, product, price)
data = sc.textFile("data/
UserPurchaseHistory.csv").map(lambda line:
line.split(",")).map(lambda record: (record[0], record[1],
record[2]))
# let's count the number of purchases
numPurchases = data.count()
# let's count how many unique users made purchases
uniqueUsers = data.map(lambda record:
record[0]).distinct().count()
# let's sum up our total revenue
totalRevenue = data.map(lambda record:
float(record[2])).sum()
# let's find our most popular product
products = data.map(lambda record: (record[1],
Search WWH ::




Custom Search