Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

The MovieLens 100k dataset

The MovieLens 100k dataset is a set of 100,000 data points related to ratings given by a set

of users to a set of movies. It also contains movie metadata and user profiles. While it is a

small dataset, you can quickly download it and run Spark code on it. This makes it ideal for

illustrative purposes.

You can download the dataset from http://files.grouplens.org/datasets/movielens/

ml-100k.zip .

Once you have downloaded the data, unzip it using your terminal:

>unzip ml-100k.zip

inflating: ml-100k/allbut.pl

inflating: ml-100k/mku.sh

inflating: ml-100k/README

...

inflating: ml-100k/ub.base

inflating: ml-100k/ub.test

This will create a directory called ml-100k . Change into this directory and examine the

contents. The important files are u.user (user profiles), u.item (movie metadata), and

u.data (the ratings given by users to movies):

>cd ml-100k

The README file contains more information on the dataset, including the variables present

in each data file. We can use the head command to examine the contents of the various

files.

For example, we can see that the u.user file contains the user id , age , gender , oc-

cupation , and ZIP code fields, separated by a pipe ( | character):

>head -5 u.user

1|24|M|technician|85711

2|53|F|other|94043

3|23|M|writer|32067

4|24|M|technician|43537

5|33|F|other|15213

Search WWH ::

Custom Search

Home