Graphics Reference
In-Depth Information
6.3.2 Data Squashing
A data squashing method seeks to compress, or “squash”, the data in such a way that
a statistical analysis carried out on the compressed data obtains the same outcome
that the one obtained with the original data set; that is, the statistical information is
preserved.
The first approach of data squashing was proposed in [ 6 ] and termed DS, as a
solution of constructing a reduced data set. DS approach to squashing is model-free
and relies on moment-matching. The squashed data set consists of a set of artificial
data points chosen to replicate the moments of the original data within subsets of the
actual data. DS studies various approaches to partitioning and ordering the moments
and also provides a theoretical justification of their method by considering a Tay-
lor series expansion of an arbitrary likelihood function. Since this relies upon the
moments of the data, it should work well for any application in which the likelihood
is well-approximated by the first few terms of a Taylor series. In practice, it is only
proven with logistic regression.
In [ 20 ], the authors proposed the “likelihood-based data squashing” (LDS). LDS
is similar to DS because it first partitions the data set and then chooses artificial data
points corresponding to each subset of the partition. Nevertheless, the algorithms
differ in how they build the partition and how they build the artificial data points. The
DS algorithm partitions the data along certain marginal quartiles, and then matches
moments. The LDS algorithm partitions the data using a likelihood-based clustering
and then selects artificial data points so as to mimic the target sampling or posterior
distribution. Both algorithms yield artificial data points with associated weights. The
usage of squashed data requires algorithms that can use these weights conveniently.
LDS is slightly more general than DS because it is also prepared for ANN-based
learning.
A subsequent approach described in [ 23 ] presents a form of data squashing based
on empirical likelihood. This method re-weights a random sample of data to match
certain expected values to the population. The benefits of this method are the reduc-
tion of optimization cost in terms of computational complexity and the interest in
enhancing the performance of boosted random trees.
6.3.3 Data Clustering
Clustering algorithms partition the data examples into groups, or clusters , so that data
samples within a cluster are “similar” to one another and different to data examples
that belong to other clusters. The similarity is usually defined by means of how near
the examples are in space, according to a distance function. The quality of a cluster
could be measured as a function of the length of its diameter, which is the maximum
distance between any two samples belonging to the cluster. The average distance
of each object within the cluster to the centroid is an alternative measure of cluster
 
Search WWH ::




Custom Search