Introduction - Visual Data Mining: The VisMiner Approach

Databases Reference

In-Depth Information

Attribute enumeration - Begin by browsing the list of attributes contained

in the dataset and the corresponding types of each attribute. Understand

what each attribute represents or measures and the units in which it is

encoded. Look for identifier or key attributes - those that uniquely identify

observations in the dataset.

Attribute distributions - For numeric types, determine the range of values

in the dataset, then look at the shape and symmetry or skew of the

distribution. Does it appear to approximate a normal distribution or some

other distribution? For nominal (categorical) data, look at the number of

unique values (categories) and the proportion of observations belonging to

each category. For example, suppose that you have an attribute called

CustomerType. The first thing that you want to determine is the number

of different CustomerTypes in the dataset and the proportions of each.

Identification of sub-populations - Look for attribute distributions that are

multimodal - that is distributions that have multiple peaks. When you see

such distributions, it indicates that the observations in the dataset are drawn

from multiple sub-populations with potentially different distributions. It is

possible that these sub-populations could generate very different models

when submitted in isolation to the data mining algorithms as compared to

the model generated when submitting the entire dataset. For example, in

some situations the purchasing behavior of risk-taking individuals may be

quite different from those that are risk averse.

Pattern search - Look for potentially interesting and significant relation-

ships (or patterns) between attributes. If your data mining objective is the

generation of a prediction model, focus on relationships between your

selected output attribute and attributes that may be considered for input.

Note the type of the relationship - linear or non-linear, direct or inverse. Ask

the question, “Does this relationship seem reasonable?” Also look at relation-

ships between potential input attributes. If they are highly correlated, then you

probably want to eliminate all but one as you conduct in-depth analyses.

Dataset preparation

The objective of dataset preparation is to change or morph the dataset into a

form that allows the dataset to be submitted to a data mining algorithm for

analysis. Tasks include:

Observation reduction - Frequently there is no need to analyze the full

dataset when a subset is sufficient. There are three reasons to reduce the

observation count in a dataset.

The amount of time required to process the full dataset may be too

computationally intensive. An organization's actual production database

Search WWH ::

Custom Search

Home