Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

by assigning a state of a Markov chain to each

data object and the transition probability between

states is defined as an exponential function of the

pair-wise distances. A transition matrix is defined

whose entries specify the transition probabilities

between the different states. Using this transition

matrix, a random walk with an infinite number of

steps would provide no information on the starting

point. For clustering, it is interesting to consider the

information loss on the starting point for a random

walk of some fixed number of steps t . The rate

of information loss is slow if the random walk is

stabilized by structures in the data, for example

if it remains within one cluster. IB is applied to

find a partitioning of the data into clusters which

best predicts the information loss after t steps of

random walk (which is the auxiliary variable in this

case). However, this approach is not completely

parameter-free since a suitable number of steps t

needs to be selected.

The second category of approaches directly

focus on parameter-free partitioning clustering and

are based on MDL and related ideas such as BIC.

For these methods, the data itself is the only source

of knowledge. Information-theoretic arguments

are applied for model selection during clustering

and, in contrast to the approaches based on IB,

these approaches involve a lossless compression

of the data. The work of Still and Bialek (2004)

provides important theoretical background by us-

ing information-theoretic arguments to relate the

maximal number of clusters that can be detected

by partitioning clustering with the size of the

data set. The algorithm XMeans (Pelleg & Moore

2000) combines the K-means paradigm with the

Bayesian Information Criterion for parameter-free

clustering. XMeans involves an efficient top-down

splitting algorithm where intermediate results are

obtained by bisecting K-means and are evaluated

with BIC. However, due to the properties of K-

means, only spherically Gaussian clusters can

be detected. The algorithm G-means (Gaussian

means) introduced in (Hamerly & Elkan 2003)

has been designed for parameter-free correlation

clustering. G-means follows a similar algorithmic

paradigm as XMeans with top-down splitting and

the application of bisecting K-means upon each

split. However, the criterion to decide whether a

cluster should be split up into two is based on a

statistical test for Gaussianity. Splitting continues

until the clusters are Gaussian, which implies of

course, that non-Gaussian clusters can not be

detected. The algorithm PG-means (Projected

Gaussian means) (Feng & Hamerly 2006) is similar

to G-means but learns models with increasing K

with the EM algorithm. In each iteration, various

one-dimensional projections of the data and the

model are tested for Gaussianity. Experiments

demonstrate that PG-means is less prone to over

fitting than G-means. Figueiredo and Jain (2002)

propose a parameter-free EM algorithm based on

the MDL principle. In contrast to XMeans which

applies BIC to evaluate intermediate results, an

MDL-based model selection criterion is directly

integrated into EM. Due to the properties of EM

Gaussian data is assumed, but the algorithm can

be supplied with a covariance matrix and thus

supports the same cluster notion as G-means and

PG-means.

It turns out that the underlying clustering al-

gorithm and the choice of the similarity measure

are already some kind of parameterization which

implicitly comes with specific assumptions. The

commonly used Euclidean distance for example

assumes Gaussian data. In addition, the algorithms

discussed so far are very sensitive with respect

to noise objects or outliers. These problems are

addressed by the recently proposed algorithm

RIC (Robust Information-theoretic Clustering)

(Böhm et al. 2006). This algorithm can be ap-

plied for post-processing an arbitrary imperfect

initial clustering. This approach is based on MDL

and introduces a coding scheme especially suit-

able for clustering together with algorithms for

purifying the initial clusters from noise. The cod-

ing scheme for a cluster is illustrated in Figure

8. Each coordinate of each cluster is associated

with a probability density function (PDF). Best

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home