Databases Reference
In-Depth Information
Attribute enumeration - Begin by browsing the list of attributes contained
in the dataset and the corresponding types of each attribute. Understand
what each attribute represents or measures and the units in which it is
encoded. Look for identifier or key attributes - those that uniquely identify
observations in the dataset.
Attribute distributions - For numeric types, determine the range of values
in the dataset, then look at the shape and symmetry or skew of the
distribution. Does it appear to approximate a normal distribution or some
other distribution? For nominal (categorical) data, look at the number of
unique values (categories) and the proportion of observations belonging to
each category. For example, suppose that you have an attribute called
CustomerType. The first thing that you want to determine is the number
of different CustomerTypes in the dataset and the proportions of each.
Identification of sub-populations - Look for attribute distributions that are
multimodal - that is distributions that have multiple peaks. When you see
such distributions, it indicates that the observations in the dataset are drawn
from multiple sub-populations with potentially different distributions. It is
possible that these sub-populations could generate very different models
when submitted in isolation to the data mining algorithms as compared to
the model generated when submitting the entire dataset. For example, in
some situations the purchasing behavior of risk-taking individuals may be
quite different from those that are risk averse.
Pattern search - Look for potentially interesting and significant relation-
ships (or patterns) between attributes. If your data mining objective is the
generation of a prediction model, focus on relationships between your
selected output attribute and attributes that may be considered for input.
Note the type of the relationship - linear or non-linear, direct or inverse. Ask
the question, “Does this relationship seem reasonable?” Also look at relation-
ships between potential input attributes. If they are highly correlated, then you
probably want to eliminate all but one as you conduct in-depth analyses.
Dataset preparation
The objective of dataset preparation is to change or morph the dataset into a
form that allows the dataset to be submitted to a data mining algorithm for
analysis. Tasks include:
Observation reduction - Frequently there is no need to analyze the full
dataset when a subset is sufficient. There are three reasons to reduce the
observation count in a dataset.
The amount of time required to process the full dataset may be too
computationally intensive. An organization's actual production database
 
Search WWH ::




Custom Search