Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Exploring the movie dataset

Next, we will investigate a few properties of the movie catalogue. We can inspect a row of

the movie data file, as we did for the user data earlier, and then count the number of

movies:

movie_data = sc.textFile("/ PATH /ml-100k/u.item")

print movie_data.first()

num_movies = movie_data.count()

print "Movies: %d" % num_movies

You will see the following output on your console:

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/

title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0

Movies: 1682

In the same manner as we did for user ages and occupations earlier, we can plot the distri-

bution of movie age, that is, the year of release relative to the current date (note that for this

dataset, the current year is 1998).

In the following code block, we can see that we need a small function called con-

vert_year to handle errors in the parsing of the release date field. This is due to

some bad data in one line of the movie data:

def convert_year(x):

try:

return int(x[-4:])

except:

return 1900 # there is a 'bad' data point with a blank

year,

which we set to 1900 and will filter out later

Once we have our utility function to parse the year of release, we can apply it to the movie

data using a map transformation and collect the results:

Search WWH ::

Custom Search

Home