Database Reference
In-Depth Information
Nick/workspace/datasets/ml-100k/u.data:0+1979173
14/03/30 11:42:41 INFO SparkContext: Job finished: first at
<console>:15, took 0.030533 s
res0: String = 196 242 3 881250949
Recall that this dataset consisted of the
user id
,
movie id
,
rating
,
timestamp
fields separated by a tab (
"\t"
) character. We don't need the time when the rating was
made to train our model, so let's simply extract the first three fields:
val rawRatings = rawData.map(_.split("\t").
take
(3))
We will first split each record on the
"\t"
character, which gives us an
Ar-
ray[String]
array. We will then use Scala's
take
function to keep only the first
3
elements of the array, which correspond to
user id
,
movie id
, and
rating
, respect-
ively.
We can inspect the first record of our new RDD by calling
rawRatings.first()
,
which collects just the first record of the RDD back to the driver program. This will result
in the following output:
14/03/30 12:24:00 INFO SparkContext: Starting job: first at
<console>:21
14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at
<console>:21) with 1 output partitions (allowLocal=true)
14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1
(first at <console>:21)
14/03/30 12:24:00 INFO DAGScheduler: Parents of final
stage: List()
14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List()
14/03/30 12:24:00 INFO DAGScheduler: Computing the
requested partition locally
14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/
Nick/workspace/datasets/ml-100k/u.data:0+1979173
14/03/30 12:24:00 INFO SparkContext: Job finished: first at
<console>:21, took 0.00391 s
res6: Array[String] = Array(196, 242, 3)