Database Reference
In-Depth Information
# create the four plots of Figure 3.7
ggplot(mydata, aes(x,y)) +
geom_point(size=4) +
geom_smooth(method="lm", fill=NA, fullrange=TRUE) +
facet_wrap(˜mygroup)
3.2.2 Dirty Data
This section addresses how dirty data can be detected in the data exploration phase
with visualizations. In general, analysts should look for anomalies, verify the data
with domain knowledge, and decide the most appropriate approach to clean the
data.
Consider a scenario in which a bank is conducting data analyses of its account
holders to gauge customer retention. Figure 3.8 shows the age distribution of the
account holders.
Figure 3.8 Age distribution of bank account holders
If the age data is in a vector called age , the graph can be created with the following
R script:
hist(age, breaks=100, main="Age Distribution of Account
Holders",
xlab="Age", ylab="Frequency", col="gray")
The figure shows that the median age of the account holders is around 40. A few
accounts with account holder age less than 10 are unusual but plausible. These
Search WWH ::




Custom Search