from simple calculations, for example, given a person's birth date,
we can compute their current age; or we can compute total minutes
of usage per year by summing the minutes of usage per month.
Derived attributes may also be used to construct a target attribute.
For example, we may compute a “churn” attribute which is set to 1 if
the percentage in minutes of usage drops by 75 percent, indicating
churn, and 0 otherwise.
Generating derived attributes often depends on the data miner's
domain understanding, creativity, or experience.
Another common data preparation step involves reducing the number
of attributes used in mining. Some data mining tools can scale to
include large volumes of data for mining—millions of cases and thou-
sands of attributes. While it is possible to mine such data volumes, it
may not always be necessary or beneficial. If nothing else, more data,
either in cases or attributes, requires more time to build a model, and
often requires more time to apply that model to new data. Moreover,
some attributes are more predictive than others. Models built using
nonpredictive or noisy attributes can actually have a negative impact
on model quality. Consider a dataset with 1,000 attributes, but only
100 of the attributes are truly useful/necessary to build a model. In a
best case scenario, building the model on the 1,000 attributes wastes
90 percent of the execution time, since only 100 attributes contribute
positively to model quality. Identifying those 100 attributes is key.
Previously, we discussed manually removing attributes that are
constants or identifiers, or contain too many missing values. How-
ever, the data mining function attribute importance can be used to
determine which attributes most contribute to model quality. In
supervised techniques such as classification and regression, attri-
bute importance identifies those attributes that best contribute to
predict the target. In unsupervised techniques such as clustering, it
can identify which attributes are most useful for distinguishing
cases among clusters. Since attribute importance ranks the attributes
from most important to least, a decision needs to be made as to what
percentage of the top attributes to include. In some cases, attributes
may be identified as negatively impacting model quality; these can
easily be removed. Others may contribute nothing; these too can
easily be removed. If there are still a large number of attributes, it
may still be appropriate to build models on different subsets of the
top attributes to decide which subset produces the best or most