Frequent Pattern Mining Algorithms for Data Clustering - Frequent Pattern Mining

Database Reference

In-Depth Information

This holds, e.g., for L P -type distances ( P

1, for example the commonly used

Euclidean distance), because distances between the points can never shrink when

adding more dimensions. Let us note that this is also the reason why the core point

property is anti-monotone (cf. Eq. 16.6 ).

In a similar way, CFPC [ 80 ] ('MineClus' in an earlier version [ 79 ]) improves by

a frequent pattern mining-based approach the subspace search strategy of an earlier

projected clustering algorithm, DOC [ 71 ]. As projected clustering approaches, both

pursue a local subspace search per (preliminary) cluster. A typical projected cluster-

ing algorithm, following the seminal approach of PROCLUS [ 3 ], starts with some

initial assignment of points to clusters. Then, the optimal projection (subspace) of

each cluster and the assignment of points are iteratively refined. In DOC, random-

sampling was applied to find the most suitable subspace for a potential cluster. CFPC

replaces this random sampling strategy by a technique related to FP-growth. A poten-

tial cluster is defined by its potential (in both approaches randomly sampled) medoid

p . For all points q , an itemset includes those dimensions in which q is close to p .A

large, frequent itemset would therefore correspond to a projected cluster with many

points and high dimensionality. To find the best cluster and its optimal projection,

FP-growth is applied over this modelling of frequent itemsets.

The projected clustering algorithm P3C [ 58 , 59 ] does also incorporate an Apriori-

like local subspace search, but in yet another variant. The basic idea of P3C is to find

cluster cores starting with “ p -signatures” that are intervals of some subset of p dis-

tinct attributes, i.e., subspace regions. Roughly, such a p -signature qualifies as a clus-

ter core if and only if its support, i.e., the number of points falling into this subspace

region, exceeds the expected support under some assumptions concerning the point

distribution, and if this happens by chance (Poisson probability) less likely than spec-

ified by some (Poisson-)threshold. By these conditions, p -signatures qualifying as

cluster cores can be generated using an Apriori-like candidate elimination procedure.

≥

3.3

Redundancy in Subspace Clustering

As pointed out above, redundancy of subspace cluster results is a problem inherited

from the Apriori strategy for traversing the search space of subspaces. As a conse-

quence, for current research on subspace clustering, reducing redundancy is a major

topic. As we have seen, the concept of borders found analogous use already in the

early subspace search algorithm ENCLUS [ 21 ] for restricting the search space. Some

approaches mine or report the most representative clusters as solutions [ 13 , 61 ]. This

is related to picking or creating a number of representative results in frequent pattern

mining. Also the idea of restricting results of frequent pattern mining to the maxi-

mal frequent itemsets found a correspondence in subspace clustering. For example,

nCluster [ 54 ], CLICKS [ 81 ], or MaPle [ 69 ] mine those subspace clusters of maximal

dimensionality.

Other variants of clustering algorithms outside subspace clustering that also tackle

high-dimensional data face a similar problem. For example, multiview clustering

Frequent Pattern Mining

Search WWH ::

Custom Search

Home