Database Reference
In-Depth Information
Nick/workspace/datasets/ml-100k/u.data:0+1979173
14/03/30 11:42:41 INFO SparkContext: Job finished: first at
<console>:15, took 0.030533 s
res0: String = 196 242 3 881250949
Recall that this dataset consisted of the user id , movie id , rating , timestamp
fields separated by a tab ( "\t" ) character. We don't need the time when the rating was
made to train our model, so let's simply extract the first three fields:
val rawRatings = rawData.map(_.split("\t"). take (3))
We will first split each record on the "\t" character, which gives us an Ar-
ray[String] array. We will then use Scala's take function to keep only the first 3
elements of the array, which correspond to user id , movie id , and rating , respect-
ively.
We can inspect the first record of our new RDD by calling rawRatings.first() ,
which collects just the first record of the RDD back to the driver program. This will result
in the following output:
14/03/30 12:24:00 INFO SparkContext: Starting job: first at
<console>:21
14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at
<console>:21) with 1 output partitions (allowLocal=true)
14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1
(first at <console>:21)
14/03/30 12:24:00 INFO DAGScheduler: Parents of final
stage: List()
14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List()
14/03/30 12:24:00 INFO DAGScheduler: Computing the
requested partition locally
14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/
Nick/workspace/datasets/ml-100k/u.data:0+1979173
14/03/30 12:24:00 INFO SparkContext: Job finished: first at
<console>:21, took 0.00391 s
res6: Array[String] = Array(196, 242, 3)
Search WWH ::




Custom Search