Database Reference
In-Depth Information
// let's count how many unique users made purchases
val uniqueUsers = data.map{ case (user, product, price)
=> user }.distinct().count()
// let's sum up our total revenue
val totalRevenue = data.map{ case (user, product,
price) => price.toDouble }.sum()
// let's find our most popular product
val productsByPopularity = data
.map{ case (user, product, price) => (product, 1) }
.reduceByKey(_ + _)
.collect()
.sortBy(-_._2)
val mostPopular = productsByPopularity(0)
This last piece of code to compute the most popular product is an example of the Map/Re-
duce pattern made popular by Hadoop. First, we mapped our records of (user,
product, price) to the records of (product, 1) . Then, we performed a re-
duceByKey operation, where we summed up the 1s for each unique product.
Once we have this transformed RDD, which contains the number of purchases for each
product, we will call collect , which returns the results of the computation to the driver
program as a local Scala collection. We will then sort these counts locally (note that in
practice, if the amount of data is large, we will perform the sorting in parallel, usually
with a Spark operation such as sortByKey ).
Finally, we will print out the results of our computations to the console:
println("Total purchases: " + numPurchases)
println("Unique users: " + uniqueUsers)
println("Total revenue: " + totalRevenue)
println("Most popular product: %s with %d
purchases".format(mostPopular._1, mostPopular._2))
}
}
We can run this program by running sbt run in the project's base directory or by run-
ning the program in your Scala IDE if you are using one. The output should look similar
to the following:
Search WWH ::




Custom Search