Database Reference
In-Depth Information
As there are many different intuitions on how objects can be similar, there exist
many different clustering algorithms for formalizing these intuitions, and extracting
such clusters from data [ 43 - 45 , 51 ]. There are two main approaches to clustering.
On the one hand we find so-called partitional algorithms [ 26 , 49 , 55 , 56 ], where
similarity of objects is directly expressed in a notion of spatial closeness. For example,
a smaller Euclidean distance between two points than between other pairs of points
in Euclidean space makes them relatively similar. On the other hand we have density-
based approaches [ 8 , 20 , 28 , 39 , 40 , 75 , 77 ], where similarity is expressed in terms
of density-connectivity of points. That is, points that find themselves in a densely
populated area in the data space are said to be 'connected' and should be assigned to
the same cluster, whereas areas of relatively low density separate different clusters.
An important point to note for unsupervised learning in general, and clustering
specifically, is that the cluster structure of the data—and hence that discovered by
a particular clustering algorithm—does not necessarily have to correlate with class
label annotations: clusters 'simply' identify structure that exists in the data [ 29 , 36 ].
This means both that clustering requires methods different from classification, as
well as that for evaluating clusters we cannot rely just on class labels.
Over the last 15 years, a lot of research effort has been invested to develop clus-
tering algorithms that can handle high-dimensional data. Compared to traditional
data with only few attributes, high-dimensional data incur particular challenges,
most prominently the difficulty of assessing the similarity of objects in a mean-
ingful manner. These issues are generally known as the 'curse of dimensionality'.
Important aspects of this infamous 'curse' and its consequences for clustering (and
related tasks) have been discussed in various studies, surveys, and overview articles
[ 4 , 9 , 17 , 18 , 27 , 30 , 41 , 42 , 50 , 52 , 76 , 83 , 85 ].
A special family of adaptations of clustering approaches to high-dimensional data
is known as 'subspace clustering'. Here the idea is that clusters do not necessarily
exhibit similarity over all attributes, but that their similarity may be restricted to
subsets of attributes; the other attributes are not relevant to the cluster structure. In
effect, there is a need for algorithms that can measure similarity of objects, and hence
detect clusters, over subspaces . Different subspaces can be relevant for different
clusters while the clusters can be obfuscated by the noise of the remaining, 'irrelevant'
attributes. There exist many similarities of this problem setting to that of mining
frequent patterns, and in fact algorithmic ideas originally developed for frequent
pattern mining form the foundations of the paradigm of subspace clustering [ 7 ].
As in pattern mining, the general intuition in subspace clustering is that an ob-
ject may be a member of several clusters, over different subsets of the attributes. In
this manner, it is possible to group the data differently depending on the features
that are considered. Figure 16.1 gives an example. As we can see, the projection to
different subspaces results in different clusters, but not all dimensions contribute to
the patterns. In the leftmost projection to the subspace consisting of dimensions x
and y , two groups are visible that are different from the groups seen in the center
projection to dimensions w and z (note that symbols are consistent across the projec-
tions shown). Interestingly, the subspace y and z does not show any clear subspace
clusters. The interesting observation here is that this view of different aspects of the
Search WWH ::




Custom Search