Database Reference
In-Depth Information
Exploring the user dataset
First, we will analyze the characteristics of MovieLens users. Enter the following lines into
your console (where
PATH
refers to the base directory in which you performed the
unzip
command to unzip the preceding MovieLens 100k dataset):
user_data = sc.textFile("/
PATH
/ml-100k/u.user")
user_data.first()
You should see output similar to this:
u'1|24|M|technician|85711'
As we can see, this is the first line of our user data file, separated by the
"|"
character.
Tip
The
first
function is similar to
collect
, but it only returns the first element of the
RDD to the driver. We can also use
take(k)
to collect only the first
k
elements of the
RDD to the driver.
Let's transform the data by splitting each line, around the
"|"
character. This will give us
an RDD where each record is a Python list that contains the user ID, age, gender, occupa-
tion, and ZIP code fields.
We will then count the number of users, genders, occupations, and ZIP codes. We can
achieve this by running the following code in the console, line by line. Note that we do not
cache the data, as it is unnecessary for this small size:
user_fields = user_data.map(lambda line: line.split("|"))
num_users = user_fields.map(lambda fields: fields[0]).count()
num_genders = user_fields.map(lambda
fields:fields[2]).distinct().count()
num_occupations = user_fields.map(lambda
fields:fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda
fields:fields[4]).distinct().count()
print "Users: %d, genders: %d, occupations: %d, ZIP codes:
%d" % (num_users, num_genders, num_occupations, num_zipcodes)