Database Reference
In-Depth Information
movie_fields = movie_data.map(lambda lines:
lines.split("|"))
years = movie_fields.map(lambda fields:
fields[2]).map(lambda x: convert_year(x))
Since we have assigned the value 1900 to any error in parsing, we can filter these bad
values out of the resulting data using Spark's filter transformation:
years_filtered = years.filter(lambda x: x != 1900)
This is a good example of how real-world datasets can often be messy and require a more
in-depth approach to parsing data. In fact, this also illustrates why data exploration is so
important, as many of these issues in data integrity and quality are picked up during this
phase.
After filtering out bad data, we will transform the list of movie release years into movie
ages by subtracting the current year, use countByValue to compute the counts for each
movie age, and finally, plot our histogram of movie ages (again, using the hist function,
where the values variable are the values of the result from countByValue , and the
bins variable are the keys):
movie_ages = years_filtered.map(lambda yr:
1998-yr).countByValue()
values = movie_ages.values()
bins = movie_ages.keys()
hist(values, bins=bins, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)
You will see an image similar to the one here; it illustrates that most of the movies were
released in the last few years before 1998:
Search WWH ::




Custom Search