Java Reference
In-Depth Information
tolerate some degree of error in the data, referred to as noise;
however, too much noise and not enough valid signal can result in
poor quality models.
Various errors can be detected and corrected through data clean-
ing techniques [Rahm/Do 2000] [Pyle 1999]. For example, reviewing
the unique values in a column may expose spelling mistakes or
invalid values. Comparing names and addresses across tables for
small differences, where most of the information is the same, can
highlight matching cases. Once many of the errors have been
addressed, duplicate cases can more easily be identified and ulti-
mately removed.
Another common plague on data involves outliers . The term out-
lier can be applied to values within an attribute or to entire cases in
the data. The effects of outliers differ depending on the data mining
technique or data preparation technique. For example, consider an
attribute income with a distribution centered around $100,000, but
also with some very high income values in the millions of dollars. If
we need to bin this data into discrete bins, we may choose to take
the maximum and minimum values, and divide the range into equal
subranges. If the minimum income was zero, and the maximum was
$10,000,000, we can divide this into five bins: 0-2M, 2M-4M, and so
on. However, with the bulk of the entries centered around $100,000,
we could find that the first bin (0-2M) contains 99 percent of the
data. Such an outcome is not very useful when mining data since
this would result in an attribute that contains values that are
99 percent the same, in effect a constant! The original distribution is
illustrated in Figure 3-3, and the binned distribution is illustrated in
Figure 3-4(a).
1M
2M
3M
4M
5M
6M
7M
8M
9M
10M
Income ($)
Figure 3-3
Binning the attribute income with outliers.
Search WWH ::




Custom Search