Databases Reference
In-Depth Information
handled in a dataset before generating visualizations of the dataset or applying
data mining algorithms.
Missing values are typically handled in one of five different ways.
Eliminate any observations with missing data from the dataset. This is
usually an acceptable solution when few observations relative to the total
number of observations in the dataset contain missing values.
Keep the observations, but drop the column with missing values. This option
may be acceptable when most of the missing values are concentrated in a
single column and that particular column is not deemed critical to the
planned analysis.
Assign a default value such as the mean or modal value.
Use the values of other columns within the dataset to predict a value for the
one that is missing. For example, a worker's “age” may be used to estimate
the “years of experience” for the worker. The estimated value may not be
exact, but at least it allows the rest of the data belonging to the observation to
be used in the pending analysis. If the estimate is reasonable, it may not
significantly bias the results of an analysis.
Look in other sources for the missing values before mining the dataset using
a tool such as VisMiner. If the missing values are important, it may be worth
the effort to find other sources for the missing data that could be merged with
your dataset to form a more complete set.
Missing values may occur for two reasons. Either the value is non-existent for
the given observation or the actual value is unknown. An example of a non-
existent item is the value of the field SpouseName for an unmarried person.
Unknown missing values occur for a number of reasons:
The entity providing the value was unable or unwilling to report a value.
The instrument capturing the data malfunctioned during data collection.
Data may not have been deemed important and thus not collected over a
limited span of the data collection period.
Handling missing data in the case of the non-existent value is usually
accomplished by assigning a default value. This is especially true when the
field type is nominal. In this case, a value of “none” or “not applicable” could
be assigned. In the case of numeric fields, assigning a value of zero is appropriate,
when it does not conflict in interpretation with data items whose actual value is
zero. For example, if in a customer dataset, the field “Total Purchases in Last
Search WWH ::




Custom Search