Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Exploring the rating dataset

Let's now take a look at the ratings data:

rating_data = sc.textFile("/ PATH /ml-100k/u.data")

print rating_data.first()

num_ratings = rating_data.count()

print "Ratings: %d" % num_ratings

This gives us the following result:

196 242 3 881250949

Ratings: 100000

There are 100,000 ratings, and unlike the user and movie datasets, these records are split

with a tab character ( "\t" ). As you might have guessed, we'd probably want to compute

some basic summary statistics and frequency histograms for the rating values. Let's do this

now:

rating_data = rating_data_raw.map(lambda line:

line.split("\t"))

ratings = rating_data.map(lambda fields: int(fields[2]))

max_rating = ratings.reduce(lambda x, y: max(x, y))

min_rating = ratings.reduce(lambda x, y: min(x, y))

mean_rating = ratings.reduce(lambda x, y: x + y) /

num_ratings

median_rating = np.median(ratings.collect())

ratings_per_user = num_ratings / num_users

ratings_per_movie = num_ratings / num_movies

print "Min rating: %d" % min_rating

print "Max rating: %d" % max_rating

print "Average rating: %2.2f" % mean_rating

print "Median rating: %d" % median_rating

print "Average # of ratings per user: %2.2f" %

ratings_per_user

print "Average # of ratings per movie: %2.2f" %

ratings_per_movie

Search WWH ::

Custom Search

Home