Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

splitting the ratings with the tab character. We will now use the rating_data variable

again in the following code.

To compute the distribution of ratings per user, we will first extract the user ID as key and

rating as value from rating_data RDD. We will then group the ratings by user ID us-

ing Spark's groupByKey function:

user_ratings_grouped = rating_data.map(lambda fields:

(int(fields[0]), int(fields[2]))).\

groupByKey()

Next, for each key (user ID), we will find the size of the set of ratings; this will give us the

number of ratings for that user:

user_ratings_byuser = user_ratings_grouped.map(lambda (k,

v): (k, len(v)))

user_ratings_byuser.take(5)

We can inspect the resulting RDD by taking a few records from it; this should give us an

RDD of the (user ID, number of ratings) pairs:

[(1, 272), (2, 62), (3, 54), (4, 24), (5, 175)]

Finally, we will plot the histogram of number of ratings per user using our favorite hist

function:

user_ratings_byuser_local = user_ratings_byuser.map(lambda

(k, v):v).collect()

hist(user_ratings_byuser_local, bins=200,

color='lightblue',normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16,10)

Your chart should look similar to the following screenshot. We can see that most of the

users give fewer than 100 ratings. The distribution of the ratings shows, however, that

there are fairly large number of users that provide hundreds of ratings.

Search WWH ::

Custom Search

Home