Database Reference
In-Depth Information
Filling in bad or missing data
We have already seen an example of filtering out bad data. Following on from the preced-
ing code, the following code snippet applies the fill-in approach to the bad release date re-
cord by assigning a value to the data point that is equal to the median year of release:
years_pre_processed = movie_fields.map(lambda fields:
fields[2]).map(lambda x: convert_year(x)).collect()
years_pre_processed_array = np.array(years_pre_processed)
First, we will compute the mean and median year of release after selecting all the year of
release data, except the bad data point. We will then use the numpy function, where , to
find the index of the bad value in years_pre_processed_array (recall that we as-
signed the value 1900 to this data point). Finally, we will use this index to assign the me-
dian release year to the bad value:
mean_year =
np.mean(years_pre_processed_array[years_pre_processed_array!=1900])
median_year =
np.median(years_pre_processed_array[years_pre_processed_array!=1900])
index_bad_data =
np.where(years_pre_processed_array==1900)[0][0]
years_pre_processed_array[index_bad_data] = median_year
print "Mean year of release: %d" % mean_year
print "Median year of release: %d" % median_year
print "Index of '1900' after assigning median: %s" %
np.where(years_pre_processed_array == 1900)[0]
You should expect to see the following output:
Mean year of release: 1989
Median year of release: 1995
Index of '1900' after assigning median: []
We computed both the mean and the median year of release here. As can be seen from the
output, the median release year is quite higher because of the skewed distribution of the
years. While it is not always straightforward to decide on precisely which fill-in value to
use for a given situation, in this case, it is certainly feasible to use the median due to this
skew.
Search WWH ::




Custom Search