Advanced Topics in Initial Exploration and Dataset Preparation Using VisMiner - Visual Data Mining: The VisMiner Approach

Databases Reference

In-Depth Information

handled in a dataset before generating visualizations of the dataset or applying

data mining algorithms.

Missing values are typically handled in one of five different ways.

Eliminate any observations with missing data from the dataset. This is

usually an acceptable solution when few observations relative to the total

number of observations in the dataset contain missing values.

Keep the observations, but drop the column with missing values. This option

may be acceptable when most of the missing values are concentrated in a

single column and that particular column is not deemed critical to the

planned analysis.

Assign a default value such as the mean or modal value.

Use the values of other columns within the dataset to predict a value for the

one that is missing. For example, a worker's “age” may be used to estimate

the “years of experience” for the worker. The estimated value may not be

exact, but at least it allows the rest of the data belonging to the observation to

be used in the pending analysis. If the estimate is reasonable, it may not

significantly bias the results of an analysis.

Look in other sources for the missing values before mining the dataset using

a tool such as VisMiner. If the missing values are important, it may be worth

the effort to find other sources for the missing data that could be merged with

your dataset to form a more complete set.

Missing values may occur for two reasons. Either the value is non-existent for

the given observation or the actual value is unknown. An example of a non-

existent item is the value of the field SpouseName for an unmarried person.

Unknown missing values occur for a number of reasons:

The entity providing the value was unable or unwilling to report a value.

The instrument capturing the data malfunctioned during data collection.

Data may not have been deemed important and thus not collected over a

limited span of the data collection period.

Handling missing data in the case of the non-existent value is usually

accomplished by assigning a default value. This is especially true when the

field type is nominal. In this case, a value of “none” or “not applicable” could

be assigned. In the case of numeric fields, assigning a value of zero is appropriate,

when it does not conflict in interpretation with data items whose actual value is

zero. For example, if in a customer dataset, the field “Total Purchases in Last

Search WWH ::

Custom Search

Home