Building a Recommendation Engine with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Nick/workspace/datasets/ml-100k/u.data:0+1979173

14/03/30 11:42:41 INFO SparkContext: Job finished: first at

<console>:15, took 0.030533 s

res0: String = 196 242 3 881250949

Recall that this dataset consisted of the user id , movie id , rating , timestamp

fields separated by a tab ( "\t" ) character. We don't need the time when the rating was

made to train our model, so let's simply extract the first three fields:

val rawRatings = rawData.map(_.split("\t"). take (3))

We will first split each record on the "\t" character, which gives us an Ar-

ray[String] array. We will then use Scala's take function to keep only the first 3

elements of the array, which correspond to user id , movie id , and rating , respect-

ively.

We can inspect the first record of our new RDD by calling rawRatings.first() ,

which collects just the first record of the RDD back to the driver program. This will result

in the following output:

14/03/30 12:24:00 INFO SparkContext: Starting job: first at

<console>:21

14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at

<console>:21) with 1 output partitions (allowLocal=true)

14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1

(first at <console>:21)

14/03/30 12:24:00 INFO DAGScheduler: Parents of final

stage: List()

14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List()

14/03/30 12:24:00 INFO DAGScheduler: Computing the

requested partition locally

14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/

Nick/workspace/datasets/ml-100k/u.data:0+1979173

14/03/30 12:24:00 INFO SparkContext: Job finished: first at

<console>:21, took 0.00391 s

res6: Array[String] = Array(196, 242, 3)

Search WWH ::

Custom Search

Home