Database Reference
In-Depth Information
The MovieLens 100k dataset
The MovieLens 100k dataset is a set of 100,000 data points related to ratings given by a set
of users to a set of movies. It also contains movie metadata and user profiles. While it is a
small dataset, you can quickly download it and run Spark code on it. This makes it ideal for
illustrative purposes.
You can download the dataset from http://files.grouplens.org/datasets/movielens/
ml-100k.zip .
Once you have downloaded the data, unzip it using your terminal:
>unzip ml-100k.zip
inflating: ml-100k/allbut.pl
inflating: ml-100k/mku.sh
inflating: ml-100k/README
...
inflating: ml-100k/ub.base
inflating: ml-100k/ub.test
This will create a directory called ml-100k . Change into this directory and examine the
contents. The important files are u.user (user profiles), u.item (movie metadata), and
u.data (the ratings given by users to movies):
>cd ml-100k
The README file contains more information on the dataset, including the variables present
in each data file. We can use the head command to examine the contents of the various
files.
For example, we can see that the u.user file contains the user id , age , gender , oc-
cupation , and ZIP code fields, separated by a pipe ( | character):
>head -5 u.user
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
Search WWH ::




Custom Search