Java Reference
In-Depth Information
values correspond to discrete, nominal categories. Ordinal attributes
are also those with discrete values, but their order is significant. In
Table 7-2, the attribute type column specifies attributes such as city,
county, state, education, and marital status as categorical attributes. The
attribute capital gains is a numerical attribute as it has continuous data
values, such as $12,500.94. The attribute credit risk is an ordinal
attribute as it has high, medium, or low as ordered relative values.
The attribute usage type specifies whether an attribute is active
should be used as input to mining; inactive —excluded from mining;
or supplementary —brought forward with the input values but not
used explicitly for mining. In Table 7-2, the usage type column
specifies attributes customer id, name , and address as inactive because
these attributes are identifiers or will not generalize to predict if a
customer is an attriter. All other attributes are active , and used as
input for data mining. In this example, we have not included supple-
mentary attributes. However, consider a derived attribute computed
as the capital gains divided by the square of age , called ageCapitalGain-
Ratio . From the user perspective, if the derived attribute ageCapital-
GainRatio appears in a model rule, it may be difficult to interpret the
underlying values as it relates to the business. In such a case, the
model can reference supplementary attributes , for example, age and
capital gain . Although these supplementary attributes are not directly
used in the model build, they can be presented in model details to
facilitate rule understanding using the corresponding values of age
and capital gain .
In addition to usual ETL 1 operations used for loading and
transforming data, data mining can involve algorithm-specific data
preparation. Such data preparation includes transformations such as
binning and normalization as introduced in Section 3.2. One may
choose to prepare data manually to leverage domain-specific knowl-
edge or to fine-tune data to improve results. The data preparation type
is used to indicate if data is manually prepared. In Table 7-2, the
preparation column lists which attributes are already prepared for
model building. For more details about data preparations refer to
[Pyle 1999].
Extraction Transformation and Loading (ETL) is the process of extracting data
from their operational data sources or external data sources, transforming the
data, which includes cleansing, aggregation, summarization, and integration;
and other transformations, and loading the data into a data mart or data
Search WWH ::

Custom Search