Database Reference
In-Depth Information
Huge volumes of web server log data is generated
every day and its potential for commercial and
non-commercial applications such as designing
online shops or providing users with personalized
content in digital libraries (Zaiane & al. 1998) is
far from being fully exploited.
In both applications scenarios, extraction of
information from the massive amounts of data
is a non-trivial, highly challenging task. In both
scenarios we want to learn unknown regularities
and structure in the data with very little previous
knowledge. In metabolite profiling, we want to
gain novel insights how certain diseases change
the pattern of metabolites. Simple statistic tests
often applied in biomedicine can provide valu-
able information. However, only a tiny part of the
information potentially available in the data can
be accessed but large parts remain unexplored.
There may be several sub-types of the disease
each associated with a unique pattern of altered
metabolism.Also in the healthy controls there may
be different types of normal yet unexplored meta-
bolic patterns. Similarly, in the second scenario we
want to find groups of users with similar behavior
to provide them personalized content.
As an important area within data mining,
clustering aims at partitioning the data into groups
such that the data objects assigned to a common
group called cluster are as similar as possible and
the objects assigned to different clusters differ as
much as possible. With the term 'data objects'
we denote the instances subjected to a cluster
analysis. Often, data objects can be represented
as a feature vectors. In the scenario of metabolite
profiling, the data objects are the subjects. Each
subject is represented by a vector composed of
the amounts of the measured metabolites. The
dimensionality of the resulting feature space equals
the number of metabolites. Alternatively, it could
also be interesting to cluster the metabolites in the
space defined by the subjects with the objective
to identify groups of metabolites having similar
prevalence across subjects.
Figure 1 displays examples of different types
of clusters in vector data. The simplest type of
a cluster is a spherical Gaussian. An example in
two-dimensional space is depicted in Figure 1(a).
Both coordinates follow a Gaussian distribution
and are statistically independent from each other.
As we will see in the next section, basic cluster-
ing algorithms can reliably detect such clusters.
More complicated are correlation clusters with
orthogonal major directions, as depicted in Figure
1(b). The objects of this cluster follow a line in
one-dimensional space which is characterized by a
strong linear dependency between the coordinates.
In addition, the major directions of the cluster
are orthogonal and can be detected by Principal
Component Analysis. Figure 1(c) displays a non-
linear correlation cluster. There exists a distinct
dependency between the two coordinates but
this dependency cannot be captured by a linear
Figure 1. Different types of clusters in vector data: (a) spherical Gaussian; (b) correlation cluster; (c)
non-linear correlation cluster; (d) non-Gaussian cluster with non-orthogonal major directions
Search WWH ::




Custom Search