Database Reference
In-Depth Information
The MovieLens 100k dataset
The MovieLens 100k dataset is a set of 100,000 data points related to ratings given by a set
of users to a set of movies. It also contains movie metadata and user profiles. While it is a
small dataset, you can quickly download it and run Spark code on it. This makes it ideal for
illustrative purposes.
You can download the dataset from
http://files.grouplens.org/datasets/movielens/
Once you have downloaded the data, unzip it using your terminal:
>unzip ml-100k.zip
inflating: ml-100k/allbut.pl
inflating: ml-100k/mku.sh
inflating: ml-100k/README
...
inflating: ml-100k/ub.base
inflating: ml-100k/ub.test
This will create a directory called
ml-100k
. Change into this directory and examine the
contents. The important files are
u.user
(user profiles),
u.item
(movie metadata), and
u.data
(the ratings given by users to movies):
>cd ml-100k
The
README
file contains more information on the dataset, including the variables present
in each data file. We can use the
head
command to examine the contents of the various
files.
For example, we can see that the
u.user
file contains the
user id
,
age
,
gender
,
oc-
cupation
, and
ZIP code
fields, separated by a pipe (
|
character):
>head -5 u.user
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213