Database Reference
In-Depth Information
3
Frequent Pattern Mining in Subspace Clustering
3.1
Subspace Cluster Search
As mentioned above, CLIQUE [ 7 ] introduced the first subspace clustering algorithm
using a monotonicity on the subspace search space. The approach uses an equal width
discretization of the input space and a density threshold per cell. A subspace cluster
is a maximal set of connected dense cells in some subspace. As a consequence, the
approach operates also algorithmically at the cell level. The monotonicity used is
that a dense cell in a k -dimensional subspace is also a dense cell in all its k
1
dimensional subspaces:
C
is a cluster in subspace
T
(16.4)
C
is part of a cluster in all subspaces
S T
Based on the corresponding anti-monotonicity, Apriori is applied from 1-dimensional
dense cells in a straightforward fashion to find all higher-dimensional dense cells. As
a variation of this base scheme, an approximation is suggested that prunes subspaces
from consideration if their dense cells do not cover a sufficiently large part of the
data.
MAFIA [ 65 ] extends the cell-based approach by adapting the cell sizes to the data
distribution. The general approach is to combine neighboring cells in one dimension
if they have similar density values. The monotonicity used is the same as in CLIQUE,
but additionally, a parallel algorithm is introduced that processes chunks of the data
on local machines that communicate to exchange cell counts at each level of the
subspace lattice. XProj [ 5 ] is an adaptation of the CLIQUE idea to clustering of
graph data based on frequent sub-graphs and was applied to cluster XML data. In
contrast to CLIQUE, XProj looks for a hard partitioning, rather than overlapping
clusters.
CLIQUE and MAFIA may miss points or subspace clusters depending on location
and resolution of the cells (see for example Fig. 16.6 ), so later works have proposed
bottom-up algorithms that do not rely on discretization. SUBCLU [ 48 ] follows the
density-based subspace clustering paradigm. As already illustrated in Fig. 16.4 , sub-
space clusters are maximal sets of density-connected points. Any subspace cluster
projection to a lower dimensional subspace is a density-connected set again (albeit
not necessarily a maximal one). Anti-monotonicity is used in that if a subspace does
not contain a density-based subspace cluster, then no superspace will either.
Note that this approach means that the notion of frequent patterns is also different
than in CLIQUE and MAFIA: in these cell-based approaches, a (frequent) item is a
(dense) cell in a particular subspace, whereas in SUBCLU (and later approaches) it is
the entire subspace. In SUBCLU, the Apriori principle is used to generate candidate
subspaces within which the actual subspace clusters are determined.
The DUSC [ 10 ] approach relies on a different definition of density than SUBCLU
does. Based on the observation that a fixed density assessment is biased and favors
Search WWH ::




Custom Search