Database Reference
In-Depth Information
areas in the future. Going beyond data mining,
we expect that there will be even more interac-
tion of clustering with other research areas, for
example information retrieval, indexing, parallel
and distributed computing.
3.
How can we automatically select a suitable
level of abstraction in clustering?
Automatically selecting that information
from data which is relevant for clustering is very
challenging. If the data is represented in a high
dimensional vector space, approaches to subspace
and projected clustering provide solutions to this
problem. Subspace clustering aims at automati-
cally detecting interesting dimensions for cluster-
ing and preserves the information that objects can
be clustered differently in different subspaces.
Projected clustering detects clusters which are as-
sociated with a specific subspace where each object
is exclusively assigned to one cluster. Clusters in
real-world data are not restricted to axis-parallel
subspaces, but can be associated with arbitrary
linear or non-linear hyper-planes and subspaces.
Correlation clustering focuses on detecting such
clusters, which are characterized by specific pat-
terns of linear or non-linear feature dependencies.
Especially the result of subspace, projected and
correlation clustering provides interesting insights
on why objects are clustered together which is
very important for interpretation. For example we
can learn from correlation clustering of metabolic
data that a specific pattern of linear dependency
of metabolites is characteristic for certain dis-
order. For general metric data represented by a
similarity matrix, spectral clustering algorithms
are very suitable. Selecting relevant information
for clustering in this context means learning a
suitable similarity measure. Recent approaches
propose techniques for automatically adjusting the
similarity measure by metric learning to improve
the cluster structure.
Semi-supervised approaches to clustering
address the second question. These approaches
demonstrate that the clustering result can be sub-
stantially improved by external side information.
This side information is usually obtained by hu-
man experts or other sources of knowledge, such
as literature databases. Most algorithms require
side information only for very few data objects to
concluSIon
At first glance, the problem specification of clus-
tering as introduced in the introduction seems to
be very simple: Find a natural partitioning of the
data into groups or clusters such that the objects
assigned to a common cluster are as similar as pos-
sible and objects assigned to different clusters dif-
fer as much as possible. This very general problem
specification is highly relevant in a large variety
of applications, wherever an overview on huge
amounts of data is desired. With the technological
progress, larger amounts of data can be acquired
and stored at decreasing costs. Thus, the practical
relevance of clustering is constantly increasing.
We have seen that clustering is indeed not a trivial
task at all. Finding a natural grouping of a small
set of objects may be easy for humans because of
our advanced cognitive abilities, most importantly
our ability to focus on relevant information and
our ability to intuitively select a suitable level of
abstraction. However, the problem size in real
applications exceeds our processing capability by
orders of magnitude. Thus, we need efficient and
effective algorithms for automatically clustering
large complex data sets. Recent developments in
clustering are exactly addressing the following
questions:
1.
How can we automatically find out which
part of the information potentially con-
tained in the data actually is relevant for
clustering?
2.
How can we exploit the cognitive abilities of
humans or other types of expert knowledge
to improve the clustering result?
Search WWH ::




Custom Search