Databases Reference
In-Depth Information
Figure 3.3 A 2-D customer data plot with respect to customer locations in a city, showing three data
clusters. Outliers may be detected as values that fall outside of the cluster sets.
process. Data discretization is discussed in Section 3.5. Some methods of classification
(e.g., neural networks) have built-in data smoothing mechanisms. Classification is the
topic of Chapters 8 and 9.
3.2.3 Data Cleaning as a Process
Missing values, noise, and inconsistencies contribute to inaccurate data. So far, we have
looked at techniques for handling missing data and for smoothing data. “But data clean-
ing is a big job. What about data cleaning as a process? How exactly does one proceed in
tackling this task? Are there any tools out there to help?”
The first step in data cleaning as a process is discrepancy detection . Discrepancies can
be caused by several factors, including poorly designed data entry forms that have many
optional fields, human error in data entry, deliberate errors (e.g., respondents not want-
ing to divulge information about themselves), and data decay (e.g., outdated addresses).
Discrepancies may also arise from inconsistent data representations and inconsistent use
of codes. Other sources of discrepancies include errors in instrumentation devices that
record data and system errors. Errors can also occur when the data are (inadequately)
used for purposes other than originally intended. There may also be inconsistencies due
to data integration (e.g., where a given attribute can have different names in different
databases). 2
2 Data integration and the removal of redundant data that can result from such integration are further
described in Section 3.3.
 
Search WWH ::




Custom Search