Database Reference
In-Depth Information
Extracting features from the MovieLens 100k
dataset
Start the Spark shell in the Spark base directory, ensuring that you provide enough memory
via the -driver-memory option:
>./bin/spark-shell -driver-memory 4g
In this example, we will use the same MovieLens dataset that we used in the previous
chapter. Use the directory in which you placed the MovieLens 100k dataset as the input
path in the following code.
First, let's inspect the raw ratings dataset:
val rawData = sc.textFile("/ PATH /ml-100k/u.data")
rawData.first()
You will see output similar to these lines of code:
14/03/30 11:42:41 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using
builtin-java classes where applicable
14/03/30 11:42:41 WARN LoadSnappy: Snappy native library not
loaded
14/03/30 11:42:41 INFO FileInputFormat: Total input paths to
process : 1
14/03/30 11:42:41 INFO SparkContext: Starting job: first at
<console>:15
14/03/30 11:42:41 INFO DAGScheduler: Got job 0 (first at
<console>:15) with 1 output partitions (allowLocal=true)
14/03/30 11:42:41 INFO DAGScheduler: Final stage: Stage 0
(first at <console>:15)
14/03/30 11:42:41 INFO DAGScheduler: Parents of final stage:
List()
14/03/30 11:42:41 INFO DAGScheduler: Missing parents: List()
14/03/30 11:42:41 INFO DAGScheduler: Computing the requested
partition locally
14/03/30 11:42:41 INFO HadoopRDD: Input split: file:/Users/
Search WWH ::




Custom Search