Database Reference
In-Depth Information
Exploring the movie dataset
Next, we will investigate a few properties of the movie catalogue. We can inspect a row of
the movie data file, as we did for the user data earlier, and then count the number of
movies:
movie_data = sc.textFile("/ PATH /ml-100k/u.item")
print movie_data.first()
num_movies = movie_data.count()
print "Movies: %d" % num_movies
You will see the following output on your console:
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/
title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
Movies: 1682
In the same manner as we did for user ages and occupations earlier, we can plot the distri-
bution of movie age, that is, the year of release relative to the current date (note that for this
dataset, the current year is 1998).
In the following code block, we can see that we need a small function called con-
vert_year to handle errors in the parsing of the release date field. This is due to
some bad data in one line of the movie data:
def convert_year(x):
try:
return int(x[-4:])
except:
return 1900 # there is a 'bad' data point with a blank
year,
which we set to 1900 and will filter out later
Once we have our utility function to parse the year of release, we can apply it to the movie
data using a map transformation and collect the results:
Search WWH ::




Custom Search