Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

6.3.2 Data Squashing

A data squashing method seeks to compress, or “squash”, the data in such a way that

a statistical analysis carried out on the compressed data obtains the same outcome

that the one obtained with the original data set; that is, the statistical information is

preserved.

The first approach of data squashing was proposed in [ 6 ] and termed DS, as a

solution of constructing a reduced data set. DS approach to squashing is model-free

and relies on moment-matching. The squashed data set consists of a set of artificial

data points chosen to replicate the moments of the original data within subsets of the

actual data. DS studies various approaches to partitioning and ordering the moments

and also provides a theoretical justification of their method by considering a Tay-

lor series expansion of an arbitrary likelihood function. Since this relies upon the

moments of the data, it should work well for any application in which the likelihood

is well-approximated by the first few terms of a Taylor series. In practice, it is only

proven with logistic regression.

In [ 20 ], the authors proposed the “likelihood-based data squashing” (LDS). LDS

is similar to DS because it first partitions the data set and then chooses artificial data

points corresponding to each subset of the partition. Nevertheless, the algorithms

differ in how they build the partition and how they build the artificial data points. The

DS algorithm partitions the data along certain marginal quartiles, and then matches

moments. The LDS algorithm partitions the data using a likelihood-based clustering

and then selects artificial data points so as to mimic the target sampling or posterior

distribution. Both algorithms yield artificial data points with associated weights. The

usage of squashed data requires algorithms that can use these weights conveniently.

LDS is slightly more general than DS because it is also prepared for ANN-based

learning.

A subsequent approach described in [ 23 ] presents a form of data squashing based

on empirical likelihood. This method re-weights a random sample of data to match

certain expected values to the population. The benefits of this method are the reduc-

tion of optimization cost in terms of computational complexity and the interest in

enhancing the performance of boosted random trees.

6.3.3 Data Clustering

Clustering algorithms partition the data examples into groups, or clusters , so that data

samples within a cluster are “similar” to one another and different to data examples

that belong to other clusters. The similarity is usually defined by means of how near

the examples are in space, according to a distance function. The quality of a cluster

could be measured as a function of the length of its diameter, which is the maximum

distance between any two samples belonging to the cluster. The average distance

of each object within the cluster to the centroid is an alternative measure of cluster

Search WWH ::

Custom Search

Home