Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

As noted before, JDM does allow the specification on a per

attribute basis whether an attribute has been prepared by the user or

not. If the user does not want the data mining engine (DME) to fur-

ther manipulate an attribute's data values, perhaps by binning or nor-

malization, the attribute is flagged as prepared . If the DME cannot

work with the data as presented—perhaps a neural network requir-

ing normalized data was presented with data in an invalid range—

the DME may choose to throw an exception or produce a poor model.

Some DMEs may be able to accept data in a more “raw” form and

perform automated transformations within the DME. In this case, the

user may flag the data as unprepared and expect the DME to prepro-

cess the data. One benefit of allowing the DME to prepare the data is

that such DME-performed transformations are typically embedded

in the model. Consequently, when data is scored or model details

examined, values are presented in terms of the original data value.

Contrast this with an example of user-provided transformations: if a

user binned the attribute age into 5 bins labeled bin-1 through bin-5,

the model may contain rules that refer to those bins, not the original

values. This makes directly interpreting model detail difficult. More-

over, when scoring data, the user must explicitly bin age before pro-

viding those values to the model. Note that identifying data as

unprepared does not mean that the user did not, or could not, pre-

pare the data in some way, perhaps by removing or replacing miss-

ing values, or by computing new attributes.

What to Look for in Data

One of the reasons for performing data analysis is to understand the

degree to which data contains useful values, or is rife with errors and

inconsistencies.

Constants and Identifiers

A simple type of analysis involves locating attributes that are

constants or identifiers . If an attribute contains all null values or the

same value, such an attribute, called a constant , contains no informa-

tion for the data mining model. For example, it may be interesting to

know all customers are from the United States, but a data mining

algorithm will not find such an attribute useful. On the other hand,

an attribute that contains all distinct values, forming a key, is called

an identifier . It can be useful to identify a case, but should not be used

as a predictor in the mining process. For example, the attribute social

security number can be used to predict which customers will attrite

Search WWH ::

Custom Search

Home