Data Preprocessing - Data Mining: Concepts and Techniques - page 91

Databases Reference

In-Depth Information

Figure 3.3 A 2-D customer data plot with respect to customer locations in a city, showing three data

clusters. Outliers may be detected as values that fall outside of the cluster sets.

process. Data discretization is discussed in Section 3.5. Some methods of classification

(e.g., neural networks) have built-in data smoothing mechanisms. Classification is the

topic of Chapters 8 and 9.

3.2.3 Data Cleaning as a Process

Missing values, noise, and inconsistencies contribute to inaccurate data. So far, we have

looked at techniques for handling missing data and for smoothing data. “But data clean-

ing is a big job. What about data cleaning as a process? How exactly does one proceed in

tackling this task? Are there any tools out there to help?”

The first step in data cleaning as a process is discrepancy detection . Discrepancies can

be caused by several factors, including poorly designed data entry forms that have many

optional fields, human error in data entry, deliberate errors (e.g., respondents not want-

ing to divulge information about themselves), and data decay (e.g., outdated addresses).

Discrepancies may also arise from inconsistent data representations and inconsistent use

of codes. Other sources of discrepancies include errors in instrumentation devices that

record data and system errors. Errors can also occur when the data are (inadequately)

used for purposes other than originally intended. There may also be inconsistencies due

to data integration (e.g., where a given attribute can have different names in different

databases). 2

2 Data integration and the removal of redundant data that can result from such integration are further

described in Section 3.3.

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home