Database Reference
In-Depth Information
This holds, e.g., for L P -type distances ( P
1, for example the commonly used
Euclidean distance), because distances between the points can never shrink when
adding more dimensions. Let us note that this is also the reason why the core point
property is anti-monotone (cf. Eq. 16.6 ).
In a similar way, CFPC [ 80 ] ('MineClus' in an earlier version [ 79 ]) improves by
a frequent pattern mining-based approach the subspace search strategy of an earlier
projected clustering algorithm, DOC [ 71 ]. As projected clustering approaches, both
pursue a local subspace search per (preliminary) cluster. A typical projected cluster-
ing algorithm, following the seminal approach of PROCLUS [ 3 ], starts with some
initial assignment of points to clusters. Then, the optimal projection (subspace) of
each cluster and the assignment of points are iteratively refined. In DOC, random-
sampling was applied to find the most suitable subspace for a potential cluster. CFPC
replaces this random sampling strategy by a technique related to FP-growth. A poten-
tial cluster is defined by its potential (in both approaches randomly sampled) medoid
p . For all points q , an itemset includes those dimensions in which q is close to p .A
large, frequent itemset would therefore correspond to a projected cluster with many
points and high dimensionality. To find the best cluster and its optimal projection,
FP-growth is applied over this modelling of frequent itemsets.
The projected clustering algorithm P3C [ 58 , 59 ] does also incorporate an Apriori-
like local subspace search, but in yet another variant. The basic idea of P3C is to find
cluster cores starting with “ p -signatures” that are intervals of some subset of p dis-
tinct attributes, i.e., subspace regions. Roughly, such a p -signature qualifies as a clus-
ter core if and only if its support, i.e., the number of points falling into this subspace
region, exceeds the expected support under some assumptions concerning the point
distribution, and if this happens by chance (Poisson probability) less likely than spec-
ified by some (Poisson-)threshold. By these conditions, p -signatures qualifying as
cluster cores can be generated using an Apriori-like candidate elimination procedure.
3.3
Redundancy in Subspace Clustering
As pointed out above, redundancy of subspace cluster results is a problem inherited
from the Apriori strategy for traversing the search space of subspaces. As a conse-
quence, for current research on subspace clustering, reducing redundancy is a major
topic. As we have seen, the concept of borders found analogous use already in the
early subspace search algorithm ENCLUS [ 21 ] for restricting the search space. Some
approaches mine or report the most representative clusters as solutions [ 13 , 61 ]. This
is related to picking or creating a number of representative results in frequent pattern
mining. Also the idea of restricting results of frequent pattern mining to the maxi-
mal frequent itemsets found a correspondence in subspace clustering. For example,
nCluster [ 54 ], CLICKS [ 81 ], or MaPle [ 69 ] mine those subspace clusters of maximal
dimensionality.
Other variants of clustering algorithms outside subspace clustering that also tackle
high-dimensional data face a similar problem. For example, multiview clustering
Search WWH ::




Custom Search