Java Reference
In-Depth Information
7.1.4
Specify Settings: Fine-Tune the Solution to the Problem
After exploring attribute values in the CUSTOMERS dataset, the data
miner found some oddities in the data. The capital gains attribute has
some extreme values that are out of range from the general popula-
tion. Figure 7-1 illustrates the distribution of capital gains values in
the data. Note that there are very few customers who have capital
gains greater than $1,000,000; in this example such values are treated
as outliers . Outliers are the values of a given attribute that are
unusual compared to the rest of that attribute's data values. For
example, if customers have capital gains over 1 million dollars, these
values could skew mining results involving the attribute capital gains
and should be treated as discussed in Section 3.2.
In this example, the capital gains attribute has a valid range of $2,000
to $1,000,000 based on the value distribution, shown in Figure 7-1. In
JDM, we use outlier identification settings to specify the valid range,
or interval , to identify outliers for the model building process. Some
data mining engines (DMEs) automatically identify and treat outliers
as part of the model building process. JDM allows data miners to
specify an outlier treatment option per attribute to inform algorithms
how to treat outliers in the build data. The outlier treatment specifies
whether attribute outlier values are treated asMissing (should be
handled as missing values) or asIs (should be handled as the original
values). Based on the problem requirements and vendor-specific
algorithm implementations, data miners can either explicitly
choose the outlier treatment or leave it to the DME.
In assessing the data, the data miner noticed that the state
attribute has some invalid entries. All ABCBank customers who are
2,000 20,000 . . . . . > 1,000,000
Outliers
Capital Gains
Figure 7-1
Capital gains value distribution.
 
Search WWH ::




Custom Search