Database Reference
In-Depth Information
splitting the ratings with the tab character. We will now use the rating_data variable
again in the following code.
To compute the distribution of ratings per user, we will first extract the user ID as key and
rating as value from rating_data RDD. We will then group the ratings by user ID us-
ing Spark's groupByKey function:
user_ratings_grouped = rating_data.map(lambda fields:
(int(fields[0]), int(fields[2]))).\
groupByKey()
Next, for each key (user ID), we will find the size of the set of ratings; this will give us the
number of ratings for that user:
user_ratings_byuser = user_ratings_grouped.map(lambda (k,
v): (k, len(v)))
user_ratings_byuser.take(5)
We can inspect the resulting RDD by taking a few records from it; this should give us an
RDD of the (user ID, number of ratings) pairs:
[(1, 272), (2, 62), (3, 54), (4, 24), (5, 175)]
Finally, we will plot the histogram of number of ratings per user using our favorite hist
function:
user_ratings_byuser_local = user_ratings_byuser.map(lambda
(k, v):v).collect()
hist(user_ratings_byuser_local, bins=200,
color='lightblue',normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)
Your chart should look similar to the following screenshot. We can see that most of the
users give fewer than 100 ratings. The distribution of the ratings shows, however, that
there are fairly large number of users that provide hundreds of ratings.
Search WWH ::




Custom Search