Java Reference
In-Depth Information
0-2M
2-4M
4-6M
6-8M
8-10M
0-100K
100-200K 200-300K 300-400K 400K-10M
Income Bin
(a)
Income Bin
(b)
Figure 3-4
Binning the attribute income with outliers treated.
An alternative is to transform, or assign a treatment to, values that
are too far away from the average, or mean, value. Standard deviation is
a typical statistic. Data values that are, say, more than 3 standard devia-
tions from the mean can be replaced by NULLs, or edge values (i.e., the
value at 3 standard deviations from the mean). This allows binning
to produce more informative bins. As illustrated in Figure 3-4(b), if we
replace the outliers with edge values, we see the distribution of data in
the bins can be more telling.
Derived Attributes
Sometimes, the data analyst or domain expert may be aware of
special relationships among predictor attributes that can be explic-
itly represented in the data. Whereas some algorithms may be able
to determine such relationships implicitly during model building,
providing them explicitly can improve model quality. Consider
three attributes: length, width and height . If we are trying to mine
data involving boxes, it may be appropriate to include the volume
(length
width
height) and surface area (2
[(length
width)
(width
length)]) as explicit attributes. We may
decide to leave the original attributes in the dataset to determine if
they provide any value on their own.
Further, we may apply a specific mathematical function such as
log to an attribute that has a very large range of possible values,
perhaps that grow exponentially. Other attributes may be derived
height)
(height
Search WWH ::




Custom Search