Database Reference
In-Depth Information
by assigning a state of a Markov chain to each
data object and the transition probability between
states is defined as an exponential function of the
pair-wise distances. A transition matrix is defined
whose entries specify the transition probabilities
between the different states. Using this transition
matrix, a random walk with an infinite number of
steps would provide no information on the starting
point. For clustering, it is interesting to consider the
information loss on the starting point for a random
walk of some fixed number of steps t . The rate
of information loss is slow if the random walk is
stabilized by structures in the data, for example
if it remains within one cluster. IB is applied to
find a partitioning of the data into clusters which
best predicts the information loss after t steps of
random walk (which is the auxiliary variable in this
case). However, this approach is not completely
parameter-free since a suitable number of steps t
needs to be selected.
The second category of approaches directly
focus on parameter-free partitioning clustering and
are based on MDL and related ideas such as BIC.
For these methods, the data itself is the only source
of knowledge. Information-theoretic arguments
are applied for model selection during clustering
and, in contrast to the approaches based on IB,
these approaches involve a lossless compression
of the data. The work of Still and Bialek (2004)
provides important theoretical background by us-
ing information-theoretic arguments to relate the
maximal number of clusters that can be detected
by partitioning clustering with the size of the
data set. The algorithm XMeans (Pelleg & Moore
2000) combines the K-means paradigm with the
Bayesian Information Criterion for parameter-free
clustering. XMeans involves an efficient top-down
splitting algorithm where intermediate results are
obtained by bisecting K-means and are evaluated
with BIC. However, due to the properties of K-
means, only spherically Gaussian clusters can
be detected. The algorithm G-means (Gaussian
means) introduced in (Hamerly & Elkan 2003)
has been designed for parameter-free correlation
clustering. G-means follows a similar algorithmic
paradigm as XMeans with top-down splitting and
the application of bisecting K-means upon each
split. However, the criterion to decide whether a
cluster should be split up into two is based on a
statistical test for Gaussianity. Splitting continues
until the clusters are Gaussian, which implies of
course, that non-Gaussian clusters can not be
detected. The algorithm PG-means (Projected
Gaussian means) (Feng & Hamerly 2006) is similar
to G-means but learns models with increasing K
with the EM algorithm. In each iteration, various
one-dimensional projections of the data and the
model are tested for Gaussianity. Experiments
demonstrate that PG-means is less prone to over
fitting than G-means. Figueiredo and Jain (2002)
propose a parameter-free EM algorithm based on
the MDL principle. In contrast to XMeans which
applies BIC to evaluate intermediate results, an
MDL-based model selection criterion is directly
integrated into EM. Due to the properties of EM
Gaussian data is assumed, but the algorithm can
be supplied with a covariance matrix and thus
supports the same cluster notion as G-means and
PG-means.
It turns out that the underlying clustering al-
gorithm and the choice of the similarity measure
are already some kind of parameterization which
implicitly comes with specific assumptions. The
commonly used Euclidean distance for example
assumes Gaussian data. In addition, the algorithms
discussed so far are very sensitive with respect
to noise objects or outliers. These problems are
addressed by the recently proposed algorithm
RIC (Robust Information-theoretic Clustering)
(Böhm et al. 2006). This algorithm can be ap-
plied for post-processing an arbitrary imperfect
initial clustering. This approach is based on MDL
and introduces a coding scheme especially suit-
able for clustering together with algorithms for
purifying the initial clusters from noise. The cod-
ing scheme for a cluster is illustrated in Figure
8. Each coordinate of each cluster is associated
with a probability density function (PDF). Best
Search WWH ::




Custom Search