Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Exploring the user dataset

First, we will analyze the characteristics of MovieLens users. Enter the following lines into

your console (where PATH refers to the base directory in which you performed the unzip

command to unzip the preceding MovieLens 100k dataset):

user_data = sc.textFile("/ PATH /ml-100k/u.user")

user_data.first()

You should see output similar to this:

u'1|24|M|technician|85711'

As we can see, this is the first line of our user data file, separated by the "|" character.

Tip

The first function is similar to collect , but it only returns the first element of the

RDD to the driver. We can also use take(k) to collect only the first k elements of the

RDD to the driver.

Let's transform the data by splitting each line, around the "|" character. This will give us

an RDD where each record is a Python list that contains the user ID, age, gender, occupa-

tion, and ZIP code fields.

We will then count the number of users, genders, occupations, and ZIP codes. We can

achieve this by running the following code in the console, line by line. Note that we do not

cache the data, as it is unnecessary for this small size:

user_fields = user_data.map(lambda line: line.split("|"))

num_users = user_fields.map(lambda fields: fields[0]).count()

num_genders = user_fields.map(lambda

fields:fields[2]).distinct().count()

num_occupations = user_fields.map(lambda

fields:fields[3]).distinct().count()

num_zipcodes = user_fields.map(lambda

fields:fields[4]).distinct().count()

print "Users: %d, genders: %d, occupations: %d, ZIP codes:

%d" % (num_users, num_genders, num_occupations, num_zipcodes)

Search WWH ::

Custom Search

Home