Java Reference
In-Depth Information
7.1.4
Specify Settings:
Fine-Tune the Solution to the Problem
After exploring attribute values in the CUSTOMERS dataset, the data
miner found some oddities in the data. The
capital gains
attribute has
some extreme values that are out of range from the general popula-
tion. Figure 7-1 illustrates the distribution of capital gains values in
the data. Note that there are very few customers who have capital
gains greater than $1,000,000; in this example such values are treated
as
outliers
. Outliers are the values of a given attribute that are
unusual compared to the rest of that attribute's data values. For
example, if customers have capital gains over 1 million dollars, these
values could skew mining results involving the attribute
capital gains
and should be treated as discussed in Section 3.2.
In this example, the
capital gains
attribute has a valid range of $2,000
to $1,000,000 based on the value distribution, shown in Figure 7-1. In
JDM, we use
outlier identification
settings to specify the valid range,
or
interval
, to identify outliers for the model building process. Some
data mining engines (DMEs) automatically identify and treat outliers
as part of the model building process. JDM allows data miners to
specify an
outlier treatment option
per attribute to inform algorithms
how to treat outliers in the build data. The
outlier treatment
specifies
whether attribute outlier values are treated
asMissing
(should be
handled as missing values) or
asIs
(should be handled as the original
values). Based on the problem requirements and vendor-specific
algorithm implementations, data miners can either explicitly
choose the outlier treatment or leave it to the DME.
In assessing the data, the data miner noticed that the
state
attribute has some invalid entries. All ABCBank customers who are
2,000 20,000 . . . . . > 1,000,000
Outliers
Capital Gains
Figure 7-1
Capital gains value distribution.
Search WWH ::
Custom Search