Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Filling in bad or missing data

We have already seen an example of filtering out bad data. Following on from the preced-

ing code, the following code snippet applies the fill-in approach to the bad release date re-

cord by assigning a value to the data point that is equal to the median year of release:

years_pre_processed = movie_fields.map(lambda fields:

fields[2]).map(lambda x: convert_year(x)).collect()

years_pre_processed_array = np.array(years_pre_processed)

First, we will compute the mean and median year of release after selecting all the year of

release data, except the bad data point. We will then use the numpy function, where , to

find the index of the bad value in years_pre_processed_array (recall that we as-

signed the value 1900 to this data point). Finally, we will use this index to assign the me-

dian release year to the bad value:

mean_year =

np.mean(years_pre_processed_array[years_pre_processed_array!=1900])

median_year =

np.median(years_pre_processed_array[years_pre_processed_array!=1900])

index_bad_data =

np.where(years_pre_processed_array==1900)[0][0]

years_pre_processed_array[index_bad_data] = median_year

print "Mean year of release: %d" % mean_year

print "Median year of release: %d" % median_year

print "Index of '1900' after assigning median: %s" %

np.where(years_pre_processed_array == 1900)[0]

You should expect to see the following output:

Mean year of release: 1989

Median year of release: 1995

Index of '1900' after assigning median: []

We computed both the mean and the median year of release here. As can be seen from the

output, the median release year is quite higher because of the skewed distribution of the

years. While it is not always straightforward to decide on precisely which fill-in value to

use for a given situation, in this case, it is certainly feasible to use the median due to this

skew.

Search WWH ::

Custom Search

Home