Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

Huge volumes of web server log data is generated

every day and its potential for commercial and

non-commercial applications such as designing

online shops or providing users with personalized

content in digital libraries (Zaiane & al. 1998) is

far from being fully exploited.

In both applications scenarios, extraction of

information from the massive amounts of data

is a non-trivial, highly challenging task. In both

scenarios we want to learn unknown regularities

and structure in the data with very little previous

knowledge. In metabolite profiling, we want to

gain novel insights how certain diseases change

the pattern of metabolites. Simple statistic tests

often applied in biomedicine can provide valu-

able information. However, only a tiny part of the

information potentially available in the data can

be accessed but large parts remain unexplored.

There may be several sub-types of the disease

each associated with a unique pattern of altered

metabolism.Also in the healthy controls there may

be different types of normal yet unexplored meta-

bolic patterns. Similarly, in the second scenario we

want to find groups of users with similar behavior

to provide them personalized content.

As an important area within data mining,

clustering aims at partitioning the data into groups

such that the data objects assigned to a common

group called cluster are as similar as possible and

the objects assigned to different clusters differ as

much as possible. With the term 'data objects'

we denote the instances subjected to a cluster

analysis. Often, data objects can be represented

as a feature vectors. In the scenario of metabolite

profiling, the data objects are the subjects. Each

subject is represented by a vector composed of

the amounts of the measured metabolites. The

dimensionality of the resulting feature space equals

the number of metabolites. Alternatively, it could

also be interesting to cluster the metabolites in the

space defined by the subjects with the objective

to identify groups of metabolites having similar

prevalence across subjects.

Figure 1 displays examples of different types

of clusters in vector data. The simplest type of

a cluster is a spherical Gaussian. An example in

two-dimensional space is depicted in Figure 1(a).

Both coordinates follow a Gaussian distribution

and are statistically independent from each other.

As we will see in the next section, basic cluster-

ing algorithms can reliably detect such clusters.

More complicated are correlation clusters with

orthogonal major directions, as depicted in Figure

1(b). The objects of this cluster follow a line in

one-dimensional space which is characterized by a

strong linear dependency between the coordinates.

In addition, the major directions of the cluster

are orthogonal and can be detected by Principal

Component Analysis. Figure 1(c) displays a non-

linear correlation cluster. There exists a distinct

dependency between the two coordinates but

this dependency cannot be captured by a linear

Figure 1. Different types of clusters in vector data: (a) spherical Gaussian; (b) correlation cluster; (c)

non-linear correlation cluster; (d) non-Gaussian cluster with non-orthogonal major directions

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home