Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

or neighbourhood sizes in density-based cluster-

ing, or the subspace dimensionality in projected

clustering. In practice, the best way to cope with

this problem often is to run the algorithm several

times with different parameter settings. Thereby,

suitable values for the parameters can be learned

in a trial and error fashion. However, this pro-

cess is very time consuming. Confronted with

a large data set to be clustered only a few trials

are feasible, or suitable parameter settings need

to be estimated from a small sample. Anyhow, it

cannot be guaranteed that at least useful values

for the parameters are obtained by this procedure.

The large number of parameters required in many

algorithms even conflicts with the unsupervised

nature of the clustering problem as introduced in

the beginning. The goal of clustering is to find a

natural grouping of the data without requiring any

background knowledge. But without background

knowledge, it is often very difficult to specify

appropriate parameter settings.

Recently, parameter-free clustering there-

fore has attracted increasing attention. Most

approaches, such as the algorithms X-Means

(Pelleg & Moore 2001), G-Means (Hammerly

& Elkan 2003) and OCI (Böhm et al. 2008) are

founded on information theory and closely related

concepts. The basic idea is to relate clustering to

data compression. Assume that data consisting of

feature vectors should be transferred via a com-

munication channel from a sender to a receiver.

Without clustering, each coordinate needs to be

fully coded by transforming the numerical value

into a bit string. If the data exhibits regularities,

clustering can drastically reduce the communica-

tion costs. For example the EM algorithm can be

applied to determine a model for the data. With

this model, the data can be compressed very ef-

fectively since only the deviations from the model

need to be encoded which requires much less bits

than the full coordinates. In addition, the model

itself needs to be encoded and transferred. The

model can be regarded as a codebook which al-

lows the receiver to de-compress the data again.

This basic idea, often referred as the Minimum

Description Length Principle (MDL) (Grünwald

2005) allows comparing different clusterings:

Assume we have two different clusterings A and

B of the same data set. We can state that A is bet-

ter than B if it allows compressing the data set

more effectively than B. Note that we consider

the overall communication cost comprising data

and model here and not only the code length spent

for the data. Thereby we achieve a natural balance

between the complexity of the model and its fit

to the data. Closely related ideas developed by

different communities include the Bayesian Infor-

mation Criterion (BIC), the Aikake Information

Criterion (AIC) and the Information Bottleneck

method (IB) (Tishby et al. 2000).

A first line of papers are based on the Infor-

mation Bottleneck method. The fundamental

idea behind IB is compressing only the relevant

characteristics of the data which leads to a lossy

compression. To judge relevance, besides the

data a second source of information is required

which is often called the auxiliary variable .

Therefore, these approaches are also related to

semi-supervised clustering but are discussed here

because of their information theoretic foundation.

Relevance is defined as the amount of information

that the data provide about the auxiliary variable.

The clustering problem is considered as finding a

lossy compression of the data preserving as much

information on the auxiliary variable as possible.

IB is particularly useful for co-occurance data

such as words and documents. In this context it

is interesting to discover clusters of words which

contain relevant information on the documents

(Slonim & Tishby 2000, Dhillon et al. 2003).

This principle has been extended by Slonim et al.

(2001) to the multivariate information bottleneck

technique which allows extracting different mean-

ingful partitions of the data simultaneously. As

demonstrated in (Tishby & Slonim 2000), the IB

principle can also be applied for clustering general

metric data. This algorithm first transforms the

similarity matrix of the data into a Markov process

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home