Database Reference
In-Depth Information
model. Linear and non-linear correlation clusters
frequently occur in high dimensional vector data,
such as metabolic data. Typically, such clusters
exist in a subset of the dimensions only. Figure
1(d) displays a non-Gaussian correlation cluster
with non-orthogonal major directions. Besides the
cluster, there are some outliers. Noise points and
outliers are common in real-world data.
This chapter provides a survey on novel
trends in clustering. We will especially focus on
highlighting the conceptual similarities and dif-
ferences among approaches. In addition, a special
focus will be on the applicability to real-world
problems. Thus, we hope to provide conceptual
survey which is valuable and maybe even inspir-
ing for different groups of readers: scientists and
students but also for practitioners looking for
solutions in a concrete application. We can only
provide a very incomplete snapshot focusing on
some current and emerging vital trends and already
apologize for all the important approaches which
are missing.
We will discuss novel algorithms which are
suitable to detect clusters in high dimensional
feature space, including linear and non-linear
correlation clusters with orthogonal and non-
orthogonal major directions in noisy real-world
data, as depicted in Figure 1(b)-(d). We will
focus not only on vector data but also introduce
solutions for clustering other types of data, for
example graphs or data streams. For example in
the internet scenario, it is interesting to cluster the
users according to their behavior. The file path
allows tracking the path of users within a website
and can be represented as a graph. For online shop
design, it would be interesting to cluster the users
based on their file paths. The result could help to
improve the structure of a website for specific
groups of customers.
Most approaches to clustering require defin-
ing a suitable representation of the data objects,
for example as feature vectors or graphs together
with a notion of object similarity. In most appli-
cations, this is challenging and a mathematical
similarity measure which fully agrees with the
needs of the application may not even exist. In
information retrieval, this problem is commonly
referred to as the semantic gap . External side
information which is often available in the form
of expert knowledge can be very helpful to cope
with this problem. For example it is known from
literature that some metabolites are similar since
they fulfill a common function in the organism.
Semi-supervised clustering, an emerging research
area which has recently attracted much attention
focuses on integrating such side information into
clustering. We will discuss some interesting solu-
tions. Besides a suitable representation of the data
objects and a notion of similarity, most clustering
algorithms require input parameters which are
often difficult to estimate, for example the number
of desired clusters. We will introduce some recent
approaches to parameter-free clustering which
avoid crucial parameters by the application of
information-theoretic concepts.
BAckground
To make this chapter self-contained, and to il-
lustrate some of the challenges associated with
clustering, we will briefly discuss two fundamental
clustering paradigms: iterative partitioning clus-
tering and hierarchical density-based clustering.
These two paradigms introduce two very differ-
ent cluster notions which have been taken up and
further elaborated by many other approaches.
For illustration and comparison, we introduce
iterative partitioning clustering on the algorithm
K-means (Duda & Hart 1973). K-means requires
a metric distance function in vector space. In
addition, the user has to specify the number of
desired clusters K as an input parameter. Usually
K-means starts with an arbitrary partitioning of
the objects into K clusters.After this initialization,
the algorithm iteratively performs the following
two steps until convergence: (1) Update centers:
For each cluster, compute the mean vector of its
Search WWH ::




Custom Search