Java Reference
In-Depth Information
perfectly for a given build dataset. However, such a model will not
generalize to different social security numbers and therefore is not
useful. Some attributes are near constants or near identifiers, meaning
that a high percentage of the values are the same or unique, respec-
tively. What constitutes high is often determined by the user or DME,
but generally can be around 95 percent depending on the dataset.
Both constants and identifiers should be excluded as predictors from
a dataset prior to mining.
Missing Values
Missing values are common in real data. For any given record, or
case , data may not have been provided, for example, a customer not
specifying his income on a warranty card. Also, data may have been
lost, for example, a temperature recording device that failed for a
period of time would collect no data. If a case has too many missing
values, it may not be worth including. Similarly, if an attribute has
too many missing values, it too may be worth excluding from the
build data.
Similar to constants and identifiers, what constitutes too many
missing values in a case or attribute is subject to experience or trial
and error. In some cases, missing values can be replaced with a con-
stant such as the average value for the attribute, or even a value pre-
dicted from another model likely built using other predictors in the
same dataset. Using a model to predict and populate missing values
is called value imputation and may, of course, produce incorrect val-
ues for a given case and thus bias the dataset. However, it may still
produce better results than leaving the values as missing. Experi-
ence, trial and error, and resulting model quality can guide the deci-
sion on how to treat missing values. If a target attribute in a
supervised mining function has missing values in the build dataset,
the corresponding cases should be removed since the model does
not know the correct answer for these cases. Some data mining
algorithms handle missing values automatically, requiring no user
preprocessing.
Errors and Outliers
Like missing values, data that contain errors are common in practice.
Errors can result from data entry mistakes, such as mistyping the
name of a town (“Bostin” instead of “Boston”), transposing digits in
a customer's social security number, or specifying an invalid date
(“13/32/06”). Errors can also be deliberate where customers misstate
income, age, interests, or even gender. Data mining techniques can
Search WWH ::




Custom Search