Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

After running these lines on your console, you will see output similar to the following res-

ult:

Min rating: 1

Max rating: 5

Average rating: 3.53

Median rating: 4

Average # of ratings per user: 106.00

Average # of ratings per movie: 59.00

We can see that the minimum rating is 1, while the maximum rating is 5. This is in line

with what we expect, since the ratings are on a scale of 1 to 5.

Spark also provides a stats function for RDDs; this function contains a numeric vari-

able (such as ratings in this case) to compute similar summary statistics:

ratings.stats()

Here is the output:

(count: 100000, mean: 3.52986, stdev: 1.12566797076, max:

5.0, min: 1.0)

Looking at the results, the average rating given by a user to a movie is around 3.5 and the

median rating is 4, so we might expect that the distribution of ratings will be skewed to-

wards slightly higher ratings. Let's see whether this is true by creating a bar chart of rating

values using a similar procedure as we did for occupations:

count_by_rating = ratings.countByValue()

x_axis = np.array(count_by_rating.keys())

y_axis = np.array([float(c) for c in

count_by_rating.values()])

# we normalize the y-axis here to percentages

y_axis_normed = y_axis / y_axis.sum()

pos = np.arange(len(x_axis))

width = 1.0

ax = plt.axes()

ax.set_xticks(pos + (width / 2))

ax.set_xticklabels(x_axis)

Search WWH ::

Custom Search

Home