Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

The first step to a Spark program in Scala

We will now use the ideas we introduced in the previous section to write a basic Spark pro-

gram to manipulate a dataset. We will start with Scala and then write the same program in

Java and Python. Our program will be based on exploring some data from an online store,

about which users have purchased which products. The data is contained in a comma-

separated-value ( CSV ) file called UserPurchaseHistory.csv , and the contents are

shown in the following snippet. The first column of the CSV is the username, the second

column is the product name, and the final column is the price:

John,iPhone Cover,9.99

John,Headphones,5.49

Jack,iPhone Cover,9.99

Jill,Samsung Galaxy Cover,8.95

Bob,iPad Cover,5.49

For our Scala program, we need to create two files: our Scala code and our project build

configuration file, using the build tool Scala Build Tool ( sbt ). For ease of use, we recom-

mend that you download the sample project code called scala-spark-app for this

chapter. This code also contains the CSV file under the data directory. You will need SBT

installed on your system in order to run this example program (we use version 0.13.1 at the

time of writing this topic).

Tip

Setting up SBT is beyond the scope of this topic; however, you can find more information

Our SBT configuration file, build.sbt , looks like this (note that the empty lines

between each line of code are required):

name := "scala-spark-app"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" %

"1.2.0 "

Search WWH ::

Custom Search

Home