Database Reference
In-Depth Information
hist(mortgage, breaks=10, xlab="Mortgage Age", col="gray",
main="Portfolio Distribution, Years Since Origination")
Figure 3.9 shows that the loans are no more than 10 years old, and these
10-year-old loans have a disproportionate frequency compared to the rest of the
population. One possible explanation is that the 10-year-old loans do not only
include loans originated 10 years ago, but also those originated earlier than that.
In other words, the 10 in the x -axis actually means ≥ 10. This sometimes happens
when data is ported from one system to another or because the data provider
decided, for some reason, not to distinguish loans that are more than 10 years old.
Analysts need to study the data further and decide the most appropriate way to
perform data cleansing.
Data analysts should perform sanity checks against domain knowledge and decide
if the dirty data needs to be eliminated. Consider the task to find out the probability
of mortgage loan default. If the past observations suggest that most defaults occur
before about the 4th year and 10-year-old mortgages rarely default, it may be safe
to eliminate the dirty data and assume that the defaulted loans are less than 10
years old. For other analyses, it may become necessary to track down the source
and find out the true origination dates.
Dirty data can occur due to acts of omission. In the sales data used at the
beginning of this chapter, it was seen that the minimum number of orders was
1 and the minimum annual sales amount was $30.02. Thus, there is a strong
possibility that the provided dataset did not include the sales data on all customers,
just the customers who purchased something during the past year.
3.2.3 Visualizing a Single Variable
Using visual representations of data is a hallmark of exploratory data analyses:
letting the data speak to its audience rather than imposing an interpretation on
the data a priori . Sections 3.2.3 and 3.2.4 examine ways of displaying data to help
explain the underlying distributions of a single variable or the relationships of two
or more variables.
R has many functions available to examine a single variable. Some of these
functions are listed in Table 3.4 .
Search WWH ::




Custom Search