Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

from simple calculations, for example, given a person's birth date,

we can compute their current age; or we can compute total minutes

of usage per year by summing the minutes of usage per month.

Derived attributes may also be used to construct a target attribute.

For example, we may compute a “churn” attribute which is set to 1 if

the percentage in minutes of usage drops by 75 percent, indicating

churn, and 0 otherwise.

Generating derived attributes often depends on the data miner's

domain understanding, creativity, or experience.

Attribute Reduction

Another common data preparation step involves reducing the number

of attributes used in mining. Some data mining tools can scale to

include large volumes of data for mining—millions of cases and thou-

sands of attributes. While it is possible to mine such data volumes, it

may not always be necessary or beneficial. If nothing else, more data,

either in cases or attributes, requires more time to build a model, and

often requires more time to apply that model to new data. Moreover,

some attributes are more predictive than others. Models built using

nonpredictive or noisy attributes can actually have a negative impact

on model quality. Consider a dataset with 1,000 attributes, but only

100 of the attributes are truly useful/necessary to build a model. In a

best case scenario, building the model on the 1,000 attributes wastes

90 percent of the execution time, since only 100 attributes contribute

positively to model quality. Identifying those 100 attributes is key.

Previously, we discussed manually removing attributes that are

constants or identifiers, or contain too many missing values. How-

ever, the data mining function attribute importance can be used to

determine which attributes most contribute to model quality. In

supervised techniques such as classification and regression, attri-

bute importance identifies those attributes that best contribute to

predict the target. In unsupervised techniques such as clustering, it

can identify which attributes are most useful for distinguishing

cases among clusters. Since attribute importance ranks the attributes

from most important to least, a decision needs to be made as to what

percentage of the top attributes to include. In some cases, attributes

may be identified as negatively impacting model quality; these can

easily be removed. Others may contribute nothing; these too can

easily be removed. If there are still a large number of attributes, it

may still be appropriate to build models on different subsets of the

top attributes to decide which subset produces the best or most

Search WWH ::

Custom Search

Home