Frequent Pattern Mining Algorithms for Data Clustering - Frequent Pattern Mining

Database Reference

In-Depth Information

As there are many different intuitions on how objects can be similar, there exist

many different clustering algorithms for formalizing these intuitions, and extracting

such clusters from data [ 43 - 45 , 51 ]. There are two main approaches to clustering.

On the one hand we find so-called partitional algorithms [ 26 , 49 , 55 , 56 ], where

similarity of objects is directly expressed in a notion of spatial closeness. For example,

a smaller Euclidean distance between two points than between other pairs of points

in Euclidean space makes them relatively similar. On the other hand we have density-

based approaches [ 8 , 20 , 28 , 39 , 40 , 75 , 77 ], where similarity is expressed in terms

of density-connectivity of points. That is, points that find themselves in a densely

populated area in the data space are said to be 'connected' and should be assigned to

the same cluster, whereas areas of relatively low density separate different clusters.

An important point to note for unsupervised learning in general, and clustering

specifically, is that the cluster structure of the data—and hence that discovered by

a particular clustering algorithm—does not necessarily have to correlate with class

label annotations: clusters 'simply' identify structure that exists in the data [ 29 , 36 ].

This means both that clustering requires methods different from classification, as

well as that for evaluating clusters we cannot rely just on class labels.

Over the last 15 years, a lot of research effort has been invested to develop clus-

tering algorithms that can handle high-dimensional data. Compared to traditional

data with only few attributes, high-dimensional data incur particular challenges,

most prominently the difficulty of assessing the similarity of objects in a mean-

ingful manner. These issues are generally known as the 'curse of dimensionality'.

Important aspects of this infamous 'curse' and its consequences for clustering (and

related tasks) have been discussed in various studies, surveys, and overview articles

[ 4 , 9 , 17 , 18 , 27 , 30 , 41 , 42 , 50 , 52 , 76 , 83 , 85 ].

A special family of adaptations of clustering approaches to high-dimensional data

is known as 'subspace clustering'. Here the idea is that clusters do not necessarily

exhibit similarity over all attributes, but that their similarity may be restricted to

subsets of attributes; the other attributes are not relevant to the cluster structure. In

effect, there is a need for algorithms that can measure similarity of objects, and hence

detect clusters, over subspaces . Different subspaces can be relevant for different

clusters while the clusters can be obfuscated by the noise of the remaining, 'irrelevant'

attributes. There exist many similarities of this problem setting to that of mining

frequent patterns, and in fact algorithmic ideas originally developed for frequent

pattern mining form the foundations of the paradigm of subspace clustering [ 7 ].

As in pattern mining, the general intuition in subspace clustering is that an ob-

ject may be a member of several clusters, over different subsets of the attributes. In

this manner, it is possible to group the data differently depending on the features

that are considered. Figure 16.1 gives an example. As we can see, the projection to

different subspaces results in different clusters, but not all dimensions contribute to

the patterns. In the leftmost projection to the subspace consisting of dimensions x

and y , two groups are visible that are different from the groups seen in the center

projection to dimensions w and z (note that symbols are consistent across the projec-

tions shown). Interestingly, the subspace y and z does not show any clear subspace

clusters. The interesting observation here is that this view of different aspects of the

Frequent Pattern Mining

Search WWH ::

Custom Search

Home