Frequent Pattern Mining Algorithms for Data Clustering - Frequent Pattern Mining

Database Reference

In-Depth Information

[ 19 , 22 , 34 , 46 ] approaches the problem from the opposite direction. It is based on

the notion of semantically different subspaces, i.e., multiple representations for the

same data. We cannot generally assume to know the different semantics of subspaces

beforehand and, accordingly, could find results in overlapping subspaces. As a con-

sequence, these approaches allow some redundancy between resulting clusters. A

certain partial overlap between concepts is allowed in order to not exclude possibly

interesting concepts.

A related though distinct way of addressing the problem of redundancy and dis-

tinctiveness of different clusters is to seek diverse clusterings by directly assessing a

certain notion of distance between different partitions (so-called alternative clustering

approaches [ 14 , 23 , 24 , 25 , 32 , 33 , 35 , 70 , 72 ]). Starting with one clustering solution,

they search for an alternative clustering solution that provides substantially differ-

ent insights. Still, alternative clustering solutions are allowed to not be absolutely

orthogonal but to show some redundancy with existing clustering solutions.

Apparently, to avoid redundancy as more 'enhanced' [ 74 ] subspace clustering al-

gorithms try to do should not be pursued as an absolute goal. Multiview clustering and

alternative clustering come from the other extreme and relax the original restriction

of 'no redundancy' more and more. Relationships between subspace clustering and

other families of clustering approaches have been discussed by Zimek and Vreeken

[ 84 ].

A question related to the redundancy issue is that of the appropriate density level.

Both of these issues have decisive influence on the clusters that are selected. Deter-

mining the right density level is a general problem also in full space density-based

clustering [ 51 ], but for clustering in subspaces, the problem is even more severe. Set-

ting a fixed density threshold for an Apriori style subspace search is not appropriate

for all possible subspaces. Consider for example any CLIQUE-style grid approach:

the volume of a hypercube increases exponentially with the dimensionality, hence the

density decreases rapidly. As a consequence, any chosen threshold introduces a bias

to identify clusters of (up to) a certain dimensionality. This observation motivates

research on adaptive density thresholds [ 10 , 62 ]. The algorithmic challenge then

comes from loss of monotonicity that would allow efficient traversal of the search

space of subspaces.

When using Euclidean distance ( L 2 ), the appropriate choice of an ε -range becomes

extremely challenging as well due to the rather counter-intuitive behavior of the

volume of the hypersphere with increasing dimensions. Let us note that, for outlier

detection, the very same problem occurs in high-dimensional data, which has been

discussed in detail by Zimek et al. [ 85 ]. Choosing the size of the neighborhood in

terms of objects rather than in terms of a radius (i.e., using k nearest neighbors instead

of an ε -range query) has been advocated as a workaround for this problem [ 2 ], to

solve at least certain aspects such as having a well-defined (non-empty) set of objects

for the density estimation or spatial properties of the neighborhood.

Search WWH ::

Custom Search

Home