Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

areas in the future. Going beyond data mining,

we expect that there will be even more interac-

tion of clustering with other research areas, for

example information retrieval, indexing, parallel

and distributed computing.

3.

How can we automatically select a suitable

level of abstraction in clustering?

Automatically selecting that information

from data which is relevant for clustering is very

challenging. If the data is represented in a high

dimensional vector space, approaches to subspace

and projected clustering provide solutions to this

problem. Subspace clustering aims at automati-

cally detecting interesting dimensions for cluster-

ing and preserves the information that objects can

be clustered differently in different subspaces.

Projected clustering detects clusters which are as-

sociated with a specific subspace where each object

is exclusively assigned to one cluster. Clusters in

real-world data are not restricted to axis-parallel

subspaces, but can be associated with arbitrary

linear or non-linear hyper-planes and subspaces.

Correlation clustering focuses on detecting such

clusters, which are characterized by specific pat-

terns of linear or non-linear feature dependencies.

Especially the result of subspace, projected and

correlation clustering provides interesting insights

on why objects are clustered together which is

very important for interpretation. For example we

can learn from correlation clustering of metabolic

data that a specific pattern of linear dependency

of metabolites is characteristic for certain dis-

order. For general metric data represented by a

similarity matrix, spectral clustering algorithms

are very suitable. Selecting relevant information

for clustering in this context means learning a

suitable similarity measure. Recent approaches

propose techniques for automatically adjusting the

similarity measure by metric learning to improve

the cluster structure.

Semi-supervised approaches to clustering

address the second question. These approaches

demonstrate that the clustering result can be sub-

stantially improved by external side information.

This side information is usually obtained by hu-

man experts or other sources of knowledge, such

as literature databases. Most algorithms require

side information only for very few data objects to

concluSIon

At first glance, the problem specification of clus-

tering as introduced in the introduction seems to

be very simple: Find a natural partitioning of the

data into groups or clusters such that the objects

assigned to a common cluster are as similar as pos-

sible and objects assigned to different clusters dif-

fer as much as possible. This very general problem

specification is highly relevant in a large variety

of applications, wherever an overview on huge

amounts of data is desired. With the technological

progress, larger amounts of data can be acquired

and stored at decreasing costs. Thus, the practical

relevance of clustering is constantly increasing.

We have seen that clustering is indeed not a trivial

task at all. Finding a natural grouping of a small

set of objects may be easy for humans because of

our advanced cognitive abilities, most importantly

our ability to focus on relevant information and

our ability to intuitively select a suitable level of

abstraction. However, the problem size in real

applications exceeds our processing capability by

orders of magnitude. Thus, we need efficient and

effective algorithms for automatically clustering

large complex data sets. Recent developments in

clustering are exactly addressing the following

questions:

1.

How can we automatically find out which

part of the information potentially con-

tained in the data actually is relevant for

clustering?

2.

How can we exploit the cognitive abilities of

humans or other types of expert knowledge

to improve the clustering result?

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home