Database Reference
In-Depth Information
Exploring the rating dataset
Let's now take a look at the ratings data:
rating_data = sc.textFile("/ PATH /ml-100k/u.data")
print rating_data.first()
num_ratings = rating_data.count()
print "Ratings: %d" % num_ratings
This gives us the following result:
196 242 3 881250949
Ratings: 100000
There are 100,000 ratings, and unlike the user and movie datasets, these records are split
with a tab character ( "\t" ). As you might have guessed, we'd probably want to compute
some basic summary statistics and frequency histograms for the rating values. Let's do this
now:
rating_data = rating_data_raw.map(lambda line:
line.split("\t"))
ratings = rating_data.map(lambda fields: int(fields[2]))
max_rating = ratings.reduce(lambda x, y: max(x, y))
min_rating = ratings.reduce(lambda x, y: min(x, y))
mean_rating = ratings.reduce(lambda x, y: x + y) /
num_ratings
median_rating = np.median(ratings.collect())
ratings_per_user = num_ratings / num_users
ratings_per_movie = num_ratings / num_movies
print "Min rating: %d" % min_rating
print "Max rating: %d" % max_rating
print "Average rating: %2.2f" % mean_rating
print "Median rating: %d" % median_rating
print "Average # of ratings per user: %2.2f" %
ratings_per_user
print "Average # of ratings per movie: %2.2f" %
ratings_per_movie
Search WWH ::




Custom Search