Java Reference
In-Depth Information
values correspond to discrete, nominal categories.
Ordinal attributes
are also those with discrete values, but their order is significant. In
Table 7-2, the
attribute type
column specifies attributes such as
city,
county, state, education, and marital status
as categorical attributes. The
attribute
capital gains
is a numerical attribute as it has continuous data
values, such as $12,500.94. The attribute
credit risk
is an ordinal
attribute as it has
high, medium, or low
as ordered relative values.
The
attribute usage
type
specifies whether an attribute is
active
—
should be used as input to mining;
inactive
—excluded from mining;
or
supplementary
—brought forward with the input values but not
used explicitly for mining. In Table 7-2, the
usage type
column
specifies attributes
customer id, name
, and
address
as
inactive
because
these attributes are identifiers or will not generalize to predict if a
customer is an attriter. All other attributes are
active
, and used as
input for data mining. In this example, we have not included supple-
mentary attributes. However, consider a derived attribute computed
as the
capital gains
divided by the
square of age
, called
ageCapitalGain-
Ratio
.
From the user perspective, if the derived attribute
ageCapital-
GainRatio
appears in a model rule, it may be difficult to interpret the
underlying values as it relates to the business. In such a case, the
model can reference
supplementary attributes
, for example,
age
and
capital gain
. Although these supplementary attributes are not directly
used in the model build, they can be presented in model details to
facilitate rule understanding using the corresponding values of
age
and
capital gain
.
In addition to usual ETL
1
operations used for loading and
transforming data, data mining can involve algorithm-specific data
preparation. Such data preparation includes transformations such as
binning and normalization as introduced in Section 3.2. One may
choose to prepare data manually to leverage domain-specific knowl-
edge or to fine-tune data to improve results. The
data preparation type
is used to indicate if data is manually prepared. In Table 7-2, the
preparation
column lists which attributes are already
prepared
for
model building. For more details about data preparations refer to
[Pyle 1999].
1
Extraction Transformation and Loading (ETL) is the process of extracting data
from their operational data sources or external data sources, transforming the
data, which includes cleansing, aggregation, summarization, and integration;
and other transformations, and loading the data into a data mart or data
warehouse.
Search WWH ::
Custom Search