Databases Reference
In-Depth Information
may have millions of observations (transactions). Mining of the entire
dataset may be too time-consuming for processing using some of the
available algorithms.
The dataset may contain sub-populations which are better mined inde-
pendently. At times, patterns emerge in sub-populations that don't exist
in the dataset as a whole.
The level of detail ( granularity ) of the data may bemore than is necessary
for the planned analysis. For example, a sales dataset may have informa-
tion on each individual sale made by an enterprise. However, for mining
purposes, sales information summarized at the customer level or other
geographic level, such as zip code, may be all that is necessary.
Observation reduction can be accomplished in three ways:
extraction of sub-populations
sampling
observation aggregation.
Dimension reduction - As dictated by the “ curse of dimensionality ”, data
becomes more sparse or spread out as the number of dimensions in a dataset
increases. This leads to a need for larger and larger sample sizes to adequately
fill the data space as the number of dimensions (attributes) increases. In
general, when applying a dataset to a data mining algorithm, the fewer the
dimensions the more likely the results are to be statistically valid. However, it
is not advisable to eliminate attributes that may contribute to good model
predictions or explanations. There is a trade-off that must be balanced.
To reduce the dimensionality of a dataset, you may selectively remove
attributes or arithmetically combine attributes.
Attributes should be removed if they are not likely to be relevant to an
intended analysis or if they are redundant. An example of an irrelevant
attribute would be an observation identifier or key field. One would not
expect a customer number, for example, to contribute anything to the
understanding of a customer's purchase behavior. An example of a redun-
dant attribute would be a measure that is recorded in multiple units. For
example, a person's weight may be recorded in pounds and kilograms - both
are not needed.
You may also arithmetically combine attributes with a formula. For
example, in a “homes for sale” dataset containing price and area (square
feet) attributes, you might derive a new attribute “price per square foot” by
dividing price by area, then eliminating the price and area attributes.
A related methodology for combining attributes to reduce the number
of dimensions is principal component analysis . It is a mathematical
Search WWH ::




Custom Search