Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

// let's count how many unique users made purchases

val uniqueUsers = data.map{ case (user, product, price)

=> user }.distinct().count()

// let's sum up our total revenue

val totalRevenue = data.map{ case (user, product,

price) => price.toDouble }.sum()

// let's find our most popular product

val productsByPopularity = data

.map{ case (user, product, price) => (product, 1) }

.reduceByKey(_ + _)

.collect()

.sortBy(-_._2)

val mostPopular = productsByPopularity(0)

This last piece of code to compute the most popular product is an example of the Map/Re-

duce pattern made popular by Hadoop. First, we mapped our records of (user,

product, price) to the records of (product, 1) . Then, we performed a re-

duceByKey operation, where we summed up the 1s for each unique product.

Once we have this transformed RDD, which contains the number of purchases for each

product, we will call collect , which returns the results of the computation to the driver

program as a local Scala collection. We will then sort these counts locally (note that in

practice, if the amount of data is large, we will perform the sorting in parallel, usually

with a Spark operation such as sortByKey ).

Finally, we will print out the results of our computations to the console:

println("Total purchases: " + numPurchases)

println("Unique users: " + uniqueUsers)

println("Total revenue: " + totalRevenue)

println("Most popular product: %s with %d

purchases".format(mostPopular._1, mostPopular._2))

}

We can run this program by running sbt run in the project's base directory or by run-

ning the program in your Scala IDE if you are using one. The output should look similar

to the following:

Search WWH ::

Custom Search

Home