Database Reference
In-Depth Information
DF1 <- na.exclude(DF)
DF1
x y
1 1 10
2 2 20
Account holders older than 100 may be due to bad data caused by typos. Another
possibility is that these accounts may have been passed down to the heirs of
the original account holders without being updated. In this case, one needs to
further examine the data and conduct data cleansing if necessary. The dirty data
could be simply removed or filtered out with an age threshold for future analyses.
If removing records is not an option, the analysts can look for patterns within
the data and develop a set of heuristics to attack the problem of dirty data. For
example, wrong age values could be replaced with
approximation
based on the
nearest neighbor—the record that is the most similar to the record in question
based on analyzing the differences in all the other variables besides age.
Figure 3.9
presents another example of dirty data. The distribution shown here
corresponds to the age of mortgages in a bank's home loan portfolio. The mortgage
age is calculated by subtracting the origination date of the loan from the current
date. The vertical axis corresponds to the number of mortgages at each mortgage
age.
Figure 3.9
Distribution of mortgage in years since origination from a bank's
home loan portfolio
following R script.