Database Reference
In-Depth Information
After running these lines on your console, you will see output similar to the following res-
ult:
Min rating: 1
Max rating: 5
Average rating: 3.53
Median rating: 4
Average # of ratings per user: 106.00
Average # of ratings per movie: 59.00
We can see that the minimum rating is 1, while the maximum rating is 5. This is in line
with what we expect, since the ratings are on a scale of 1 to 5.
Spark also provides a stats function for RDDs; this function contains a numeric vari-
able (such as ratings in this case) to compute similar summary statistics:
ratings.stats()
Here is the output:
(count: 100000, mean: 3.52986, stdev: 1.12566797076, max:
5.0, min: 1.0)
Looking at the results, the average rating given by a user to a movie is around 3.5 and the
median rating is 4, so we might expect that the distribution of ratings will be skewed to-
wards slightly higher ratings. Let's see whether this is true by creating a bar chart of rating
values using a similar procedure as we did for occupations:
count_by_rating = ratings.countByValue()
x_axis = np.array(count_by_rating.keys())
y_axis = np.array([float(c) for c in
count_by_rating.values()])
# we normalize the y-axis here to percentages
y_axis_normed = y_axis / y_axis.sum()
pos = np.arange(len(x_axis))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)
Search WWH ::




Custom Search