Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

1.0)).reduceByKey(lambda a, b: a + b).collect()

mostPopular = sorted(products, key=lambda x: x[1],

reverse=True)[0]

print "Total purchases: %d" % numPurchases

print "Unique users: %d" % uniqueUsers

print "Total revenue: %2.2f" % totalRevenue

print "Most popular product: %s with %d purchases" %

(mostPopular[0], mostPopular[1])

If you compare the Scala and Python versions of our program, you will see that generally,

the syntax looks very similar. One key difference is how we express anonymous functions

(also called lambda functions; hence, the use of this keyword for the Python syntax). In

Scala, we've seen that an anonymous function mapping an input x to an output y is ex-

pressed as x => y , while in Python, it is lambda x: y . In the highlighted line in the

preceding code, we are applying an anonymous function that maps two inputs, a and b ,

generally of the same type, to an output. In this case, the function that we apply is the plus

function; hence, lambda a, b: a + b .

The best way to run the script is to run the following command from the base directory of

the sample project:

>$SPARK_HOME/bin/spark-submit pythonapp.py

Here, the SPARK_HOME variable should be replaced with the path of the directory in

which you originally unpacked the Spark prebuilt binary package at the start of this

chapter.

Upon running the script, you should see output similar to that of the Scala and Java ex-

amples, with the results of our computation being the same:

...

14/01/30 11:43:47 INFO SparkContext: Job finished: collect

at pythonapp.py:14, took 0.050251 s

Total purchases: 5

Unique users: 4

Total revenue: 39.91

Most popular product: iPhone Cover with 2 purchases

Search WWH ::

Custom Search

Home