Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

perfectly for a given build dataset. However, such a model will not

generalize to different social security numbers and therefore is not

useful. Some attributes are near constants or near identifiers, meaning

that a high percentage of the values are the same or unique, respec-

tively. What constitutes high is often determined by the user or DME,

but generally can be around 95 percent depending on the dataset.

Both constants and identifiers should be excluded as predictors from

a dataset prior to mining.

Missing Values

Missing values are common in real data. For any given record, or

case , data may not have been provided, for example, a customer not

specifying his income on a warranty card. Also, data may have been

lost, for example, a temperature recording device that failed for a

period of time would collect no data. If a case has too many missing

values, it may not be worth including. Similarly, if an attribute has

too many missing values, it too may be worth excluding from the

build data.

Similar to constants and identifiers, what constitutes too many

missing values in a case or attribute is subject to experience or trial

and error. In some cases, missing values can be replaced with a con-

stant such as the average value for the attribute, or even a value pre-

dicted from another model likely built using other predictors in the

same dataset. Using a model to predict and populate missing values

is called value imputation and may, of course, produce incorrect val-

ues for a given case and thus bias the dataset. However, it may still

produce better results than leaving the values as missing. Experi-

ence, trial and error, and resulting model quality can guide the deci-

sion on how to treat missing values. If a target attribute in a

supervised mining function has missing values in the build dataset,

the corresponding cases should be removed since the model does

not know the correct answer for these cases. Some data mining

algorithms handle missing values automatically, requiring no user

preprocessing.

Errors and Outliers

Like missing values, data that contain errors are common in practice.

Errors can result from data entry mistakes, such as mistyping the

name of a town (“Bostin” instead of “Boston”), transposing digits in

a customer's social security number, or specifying an invalid date

(“13/32/06”). Errors can also be deliberate where customers misstate

income, age, interests, or even gender. Data mining techniques can

Search WWH ::

Custom Search

Home