Java Data Mining Concepts - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

7.1.4

Specify Settings: Fine-Tune the Solution to the Problem

After exploring attribute values in the CUSTOMERS dataset, the data

miner found some oddities in the data. The capital gains attribute has

some extreme values that are out of range from the general popula-

tion. Figure 7-1 illustrates the distribution of capital gains values in

the data. Note that there are very few customers who have capital

gains greater than $1,000,000; in this example such values are treated

as outliers . Outliers are the values of a given attribute that are

unusual compared to the rest of that attribute's data values. For

example, if customers have capital gains over 1 million dollars, these

values could skew mining results involving the attribute capital gains

and should be treated as discussed in Section 3.2.

In this example, the capital gains attribute has a valid range of $2,000

to $1,000,000 based on the value distribution, shown in Figure 7-1. In

JDM, we use outlier identification settings to specify the valid range,

or interval , to identify outliers for the model building process. Some

data mining engines (DMEs) automatically identify and treat outliers

as part of the model building process. JDM allows data miners to

specify an outlier treatment option per attribute to inform algorithms

how to treat outliers in the build data. The outlier treatment specifies

whether attribute outlier values are treated asMissing (should be

handled as missing values) or asIs (should be handled as the original

values). Based on the problem requirements and vendor-specific

algorithm implementations, data miners can either explicitly

choose the outlier treatment or leave it to the DME.

In assessing the data, the data miner noticed that the state

attribute has some invalid entries. All ABCBank customers who are

2,000 20,000 . . . . . > 1,000,000

Outliers

Capital Gains

Figure 7-1

Capital gains value distribution.

Search WWH ::

Custom Search

Home