Database Reference
In-Depth Information
Exploring the user dataset
First, we will analyze the characteristics of MovieLens users. Enter the following lines into
your console (where PATH refers to the base directory in which you performed the unzip
command to unzip the preceding MovieLens 100k dataset):
user_data = sc.textFile("/ PATH /ml-100k/u.user")
user_data.first()
You should see output similar to this:
u'1|24|M|technician|85711'
As we can see, this is the first line of our user data file, separated by the "|" character.
Tip
The first function is similar to collect , but it only returns the first element of the
RDD to the driver. We can also use take(k) to collect only the first k elements of the
RDD to the driver.
Let's transform the data by splitting each line, around the "|" character. This will give us
an RDD where each record is a Python list that contains the user ID, age, gender, occupa-
tion, and ZIP code fields.
We will then count the number of users, genders, occupations, and ZIP codes. We can
achieve this by running the following code in the console, line by line. Note that we do not
cache the data, as it is unnecessary for this small size:
user_fields = user_data.map(lambda line: line.split("|"))
num_users = user_fields.map(lambda fields: fields[0]).count()
num_genders = user_fields.map(lambda
fields:fields[2]).distinct().count()
num_occupations = user_fields.map(lambda
fields:fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda
fields:fields[4]).distinct().count()
print "Users: %d, genders: %d, occupations: %d, ZIP codes:
%d" % (num_users, num_genders, num_occupations, num_zipcodes)
Search WWH ::




Custom Search