Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

movie_fields = movie_data.map(lambda lines:

lines.split("|"))

years = movie_fields.map(lambda fields:

fields[2]).map(lambda x: convert_year(x))

Since we have assigned the value 1900 to any error in parsing, we can filter these bad

values out of the resulting data using Spark's filter transformation:

years_filtered = years.filter(lambda x: x != 1900)

This is a good example of how real-world datasets can often be messy and require a more

in-depth approach to parsing data. In fact, this also illustrates why data exploration is so

important, as many of these issues in data integrity and quality are picked up during this

phase.

After filtering out bad data, we will transform the list of movie release years into movie

ages by subtracting the current year, use countByValue to compute the counts for each

movie age, and finally, plot our histogram of movie ages (again, using the hist function,

where the values variable are the values of the result from countByValue , and the

bins variable are the keys):

movie_ages = years_filtered.map(lambda yr:

1998-yr).countByValue()

values = movie_ages.values()

bins = movie_ages.keys()

hist(values, bins=bins, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16,10)

You will see an image similar to the one here; it illustrates that most of the movies were

released in the last few years before 1998:

Search WWH ::

Custom Search

Home