Database Reference
In-Depth Information
movie_fields = movie_data.map(lambda lines:
lines.split("|"))
years = movie_fields.map(lambda fields:
fields[2]).map(lambda x: convert_year(x))
Since we have assigned the value
1900
to any error in parsing, we can filter these bad
values out of the resulting data using Spark's
filter
transformation:
years_filtered = years.filter(lambda x: x != 1900)
This is a good example of how real-world datasets can often be messy and require a more
in-depth approach to parsing data. In fact, this also illustrates why data exploration is so
important, as many of these issues in data integrity and quality are picked up during this
phase.
After filtering out bad data, we will transform the list of movie release years into movie
ages by subtracting the current year, use
countByValue
to compute the counts for each
movie age, and finally, plot our histogram of movie ages (again, using the
hist
function,
where the
values
variable are the values of the result from
countByValue
, and the
bins
variable are the keys):
movie_ages = years_filtered.map(lambda yr:
1998-yr).countByValue()
values = movie_ages.values()
bins = movie_ages.keys()
hist(values, bins=bins, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)
You will see an image similar to the one here; it illustrates that most of the movies were
released in the last few years before 1998: