Java Reference
In-Depth Information
suitable model. Data miners may trade off greater speed of scoring
for a small reduction in model accuracy.
Some algorithms, such as decision tree, identify important
attributes by the very nature of their processing. Providing a large
number of attributes to such algorithms can work as well as, or better
than, attribute importance for supervised learning. In subsequent
model builds, perhaps when refreshing the model or building a pro-
duction model, it may be reasonable to include only those attributes
actually used in the initial model.
Transforming Data
Much of what we have discussed in this section involves transforming
data. There are some fairly standard data mining transformations we
should introduce in more detail: binning, normalization, explosion, sam-
pling, and recoding . Binning involves reducing the cardinality— the num-
ber of distinct values—of an attribute. Some algorithms, like naïve
bayes, work best when provided a relatively small number of distinct
values per attribute. Consider the attribute age, with a continuous value
range from 0 to 100. A person's age can be any number in this range. If
real numbers are allowed, there are an infinite number of possible val-
ues, which make understanding the distribution of ages difficult. Bin-
ning of numbers involves defining intervals or ranges of numbers that
correspond to a bin, or single value. In the example involving age, we
could define four bins: 0
25, 25
50, 50
75, 75
100, where the lower
endpoint is excluded (e.g., 25
50). Producing a
bar chart of binned age frequencies gives us a more meaningful view of
the distribution. Alternatively, binning can also be applied to discrete
values to reduce cardinality. For example, consider the 50 states of the
United States, we may want to bin this into five bins: northeast, south-
east, central, southwest, northwest . Here, we can explicitly map CT, MA,
ME, NH, NJ, NY, PA, RI, and VT to Northeast.
The normalization transformation applies only to numerical data
where we need to compress or normalize the scale of an attribute's
values. Normalization allows us to avoid having one attribute
overly impact an algorithm's processing simply because it contains
large numbers. For example, a neural network requires its input
data to be normalized to insure that an attribute income , with a range
from 0-1,000,000, doesn't overshadow the attribute age , with a range
of 0 to 100. Normalization proportionally “squeezes” the values into
a uniform range, typically 0 to 1 or
50 means 25
x
1 to 1.
The explosion transformation applies only to discrete data such as
strings or numbers representing individual categories. The goal is to
Search WWH ::

Custom Search