Database Reference
In-Depth Information
molecular structures. However, only relatively
few papers, such as (Tsuda & Kudo 2006) focus
on clustering such type of data.
Not only the amount of data collected in
modern applications is rapidly increasing but
also the structure of data becomes more and
more rich, diverse and complex. Therefore, inte-
grative clustering of information from different
sources will continue to attract much attention.
Related to semi-supervised clustering is the task
of clustering of multi-represented objects. As
discussed, algorithms for semi-supervised clus-
tering typically consider only relatively simple
types of side information such as constraints or
labels. The goal of multi-represented clustering is
integrative clustering of several equally complex
sources. First approaches have been proposed for
different underlying clustering paradigms, for
example spectral clustering (De Sa 2005), the EM
algorithm (Bickel & Scheffer 2004) and density-
based clustering (Achtert et al. 2006).
The presentation of highly specialized meth-
ods for the needs emerging from the application
side may lead to the impression that the research
community working on clustering (which is
anyhow split up into different sub-communities
originating from data mining, databases, machine
learning, statistics and physics) is continuously
diversifying. But there is also a lot of effort on in-
tegration. Theoretical work on similarities or even
equivalence of at first glance completely different
clustering paradigms not only leads to interesting
insights but can also result in substantial gains
in effectiveness and efficiency. As mentioned,
(Dhillon et al 2004) demonstrate the equivalence
of the normalized cut objective function in spectral
clustering with weighted kernel K-means. This
allows more efficient spectral clustering without
matrix decomposition. Song et al. (2007) provide
a unified view of many clustering algorithms
including K-means, spectral and hierarchical
clustering, regarding the clustering problem as
maximization of dependence between the data
objects and their cluster labels. A formulation of
this idea using the Hilbert-Schmidt Independence
Criterion and kernel methods is elaborated. In ad-
dition, the authors provide guidelines for practical
application. The trend towards a unified view is
not restricted to clustering paradigms only, but also
on integrating clustering and closely related tech-
niques from mathematics and statistics, especially
techniques for matrix factorization and dimen-
sionality reduction. Ding and He (2004) explore
the relationship between K-means clustering and
Principle Component Analysis (PCA). Principle
Components actually are the continuous solutions
of the cluster membership indicators obtained
by K-means. This result allows providing lower
bounds on the optimality of K-means. In addition,
K-means can significantly profit from PCA: PCA
provides a good initialization for K-means, and
there is a theoretical justification to apply PCA as a
preprocessing for dimensionality reduction before
K-means (at least for data of moderate to medium
dimensionality). These examples demonstrate
that the integrative view of different clustering
paradigms and related techniques not only has
a theoretical value but also has an impact on the
application of clustering algorithms in practice.
We believe that this research direction has great
potential in the future.
But not only within clustering there is a trend
towards unification. Clustering also fruitfully
integrates into other related research areas. Within
data mining and machine learning there are close
relationships to the areas of classification and
outlier detection. The evolving research area of
semi-supervised learning is crossing the borders
between traditional unsupervised clustering with-
out external knowledge and classification, which
is the classical task within supervised learning.
The goal of outlier detection is to find the ex-
ceptional objects of a data set. To specify what
exceptional or outstanding means in the context
of the given data set, it is necessary to have an
idea about what is normal or common. Therefore,
outlier detection is closely related to clustering
and we expect further interactions between these
Search WWH ::




Custom Search