Database Reference
In-Depth Information
or neighbourhood sizes in density-based cluster-
ing, or the subspace dimensionality in projected
clustering. In practice, the best way to cope with
this problem often is to run the algorithm several
times with different parameter settings. Thereby,
suitable values for the parameters can be learned
in a trial and error fashion. However, this pro-
cess is very time consuming. Confronted with
a large data set to be clustered only a few trials
are feasible, or suitable parameter settings need
to be estimated from a small sample. Anyhow, it
cannot be guaranteed that at least useful values
for the parameters are obtained by this procedure.
The large number of parameters required in many
algorithms even conflicts with the unsupervised
nature of the clustering problem as introduced in
the beginning. The goal of clustering is to find a
natural grouping of the data without requiring any
background knowledge. But without background
knowledge, it is often very difficult to specify
appropriate parameter settings.
Recently, parameter-free clustering there-
fore has attracted increasing attention. Most
approaches, such as the algorithms X-Means
(Pelleg & Moore 2001), G-Means (Hammerly
& Elkan 2003) and OCI (Böhm et al. 2008) are
founded on information theory and closely related
concepts. The basic idea is to relate clustering to
data compression. Assume that data consisting of
feature vectors should be transferred via a com-
munication channel from a sender to a receiver.
Without clustering, each coordinate needs to be
fully coded by transforming the numerical value
into a bit string. If the data exhibits regularities,
clustering can drastically reduce the communica-
tion costs. For example the EM algorithm can be
applied to determine a model for the data. With
this model, the data can be compressed very ef-
fectively since only the deviations from the model
need to be encoded which requires much less bits
than the full coordinates. In addition, the model
itself needs to be encoded and transferred. The
model can be regarded as a codebook which al-
lows the receiver to de-compress the data again.
This basic idea, often referred as the Minimum
Description Length Principle (MDL) (Grünwald
2005) allows comparing different clusterings:
Assume we have two different clusterings A and
B of the same data set. We can state that A is bet-
ter than B if it allows compressing the data set
more effectively than B. Note that we consider
the overall communication cost comprising data
and model here and not only the code length spent
for the data. Thereby we achieve a natural balance
between the complexity of the model and its fit
to the data. Closely related ideas developed by
different communities include the Bayesian Infor-
mation Criterion (BIC), the Aikake Information
Criterion (AIC) and the Information Bottleneck
method (IB) (Tishby et al. 2000).
A first line of papers are based on the Infor-
mation Bottleneck method. The fundamental
idea behind IB is compressing only the relevant
characteristics of the data which leads to a lossy
compression. To judge relevance, besides the
data a second source of information is required
which is often called the auxiliary variable .
Therefore, these approaches are also related to
semi-supervised clustering but are discussed here
because of their information theoretic foundation.
Relevance is defined as the amount of information
that the data provide about the auxiliary variable.
The clustering problem is considered as finding a
lossy compression of the data preserving as much
information on the auxiliary variable as possible.
IB is particularly useful for co-occurance data
such as words and documents. In this context it
is interesting to discover clusters of words which
contain relevant information on the documents
(Slonim & Tishby 2000, Dhillon et al. 2003).
This principle has been extended by Slonim et al.
(2001) to the multivariate information bottleneck
technique which allows extracting different mean-
ingful partitions of the data simultaneously. As
demonstrated in (Tishby & Slonim 2000), the IB
principle can also be applied for clustering general
metric data. This algorithm first transforms the
similarity matrix of the data into a Markov process
Search WWH ::




Custom Search