Introduction - Visual Data Mining: The VisMiner Approach

Databases Reference

In-Depth Information

may have millions of observations (transactions). Mining of the entire

dataset may be too time-consuming for processing using some of the

available algorithms.

The dataset may contain sub-populations which are better mined inde-

pendently. At times, patterns emerge in sub-populations that don't exist

in the dataset as a whole.

The level of detail ( granularity ) of the data may bemore than is necessary

for the planned analysis. For example, a sales dataset may have informa-

tion on each individual sale made by an enterprise. However, for mining

purposes, sales information summarized at the customer level or other

geographic level, such as zip code, may be all that is necessary.

Observation reduction can be accomplished in three ways:

extraction of sub-populations

sampling

observation aggregation.

Dimension reduction - As dictated by the “ curse of dimensionality ”, data

becomes more sparse or spread out as the number of dimensions in a dataset

increases. This leads to a need for larger and larger sample sizes to adequately

fill the data space as the number of dimensions (attributes) increases. In

general, when applying a dataset to a data mining algorithm, the fewer the

dimensions the more likely the results are to be statistically valid. However, it

is not advisable to eliminate attributes that may contribute to good model

predictions or explanations. There is a trade-off that must be balanced.

To reduce the dimensionality of a dataset, you may selectively remove

attributes or arithmetically combine attributes.

Attributes should be removed if they are not likely to be relevant to an

intended analysis or if they are redundant. An example of an irrelevant

attribute would be an observation identifier or key field. One would not

expect a customer number, for example, to contribute anything to the

understanding of a customer's purchase behavior. An example of a redun-

dant attribute would be a measure that is recorded in multiple units. For

example, a person's weight may be recorded in pounds and kilograms - both

are not needed.

You may also arithmetically combine attributes with a formula. For

example, in a “homes for sale” dataset containing price and area (square

feet) attributes, you might derive a new attribute “price per square foot” by

dividing price by area, then eliminating the price and area attributes.

A related methodology for combining attributes to reduce the number

of dimensions is principal component analysis . It is a mathematical

Search WWH ::

Custom Search

Home