Java Reference

In-Depth Information

0-2M

2-4M

4-6M

6-8M

8-10M

0-100K

100-200K 200-300K 300-400K 400K-10M

Income Bin

(a)

Income Bin

(b)

Figure 3-4

Binning the attribute income with outliers treated.

An alternative is to transform, or assign a
treatment
to, values that

are too far away from the average, or
mean,
value. Standard deviation is

a typical statistic. Data values that are, say, more than 3 standard devia-

tions from the mean can be replaced by NULLs, or edge values (i.e., the

value at 3 standard deviations from the mean). This allows binning

to produce more informative bins. As illustrated in Figure 3-4(b), if we

replace the outliers with edge values, we see the distribution of data in

the bins can be more telling.

Derived Attributes

Sometimes, the data analyst or domain expert may be aware of

special relationships among predictor attributes that can be explic-

itly represented in the data. Whereas some algorithms may be able

to determine such relationships implicitly during model building,

providing them explicitly can improve model quality. Consider

three attributes:
length,
width
and
height
. If we are trying to mine

data involving boxes, it may be appropriate to include the
volume

(length

width

height) and
surface area
(2

[(length

width)

(width

length)]) as explicit attributes. We may

decide to leave the original attributes in the dataset to determine if

they provide any value on their own.

Further, we may apply a specific mathematical function such as

log to an attribute that has a very large range of possible values,

perhaps that grow exponentially. Other attributes may be derived

height)

(height

Search WWH ::

Custom Search