Java Reference
In-Depth Information
As noted before, JDM does allow the specification on a per
attribute basis whether an attribute has been prepared by the user or
not. If the user does not want the data mining engine (DME) to fur-
ther manipulate an attribute's data values, perhaps by binning or nor-
malization, the attribute is flagged as prepared . If the DME cannot
work with the data as presented—perhaps a neural network requir-
ing normalized data was presented with data in an invalid range—
the DME may choose to throw an exception or produce a poor model.
Some DMEs may be able to accept data in a more “raw” form and
perform automated transformations within the DME. In this case, the
user may flag the data as unprepared and expect the DME to prepro-
cess the data. One benefit of allowing the DME to prepare the data is
that such DME-performed transformations are typically embedded
in the model. Consequently, when data is scored or model details
examined, values are presented in terms of the original data value.
Contrast this with an example of user-provided transformations: if a
user binned the attribute age into 5 bins labeled bin-1 through bin-5,
the model may contain rules that refer to those bins, not the original
values. This makes directly interpreting model detail difficult. More-
over, when scoring data, the user must explicitly bin age before pro-
viding those values to the model. Note that identifying data as
unprepared does not mean that the user did not, or could not, pre-
pare the data in some way, perhaps by removing or replacing miss-
ing values, or by computing new attributes.
What to Look for in Data
One of the reasons for performing data analysis is to understand the
degree to which data contains useful values, or is rife with errors and
inconsistencies.
Constants and Identifiers
A simple type of analysis involves locating attributes that are
constants or identifiers . If an attribute contains all null values or the
same value, such an attribute, called a constant , contains no informa-
tion for the data mining model. For example, it may be interesting to
know all customers are from the United States, but a data mining
algorithm will not find such an attribute useful. On the other hand,
an attribute that contains all distinct values, forming a key, is called
an identifier . It can be useful to identify a case, but should not be used
as a predictor in the mining process. For example, the attribute social
security number can be used to predict which customers will attrite
Search WWH ::




Custom Search