Database Reference
In-Depth Information
2
1
0
0
1
2
3
x dimension
Fig. 16.6 Standard grid-based discretization as used e.g. in CLIQUE: the accuracy of subspace
clustering depends on location and resolution of the grid. A minimum cell count of more than three
will miss the subspace cluster at the bottom right , whereas a minimum cell count of three will also
report cells that contain a few isolated noise points (e.g., cell at the center right )
low dimensional subspace clusters over high-dimensional ones, the density measure
is normalized by the expected density. This means that (anti-)monotonicity is lost,
and standard application of Apriori is not possible. However, as proposed in a later
extension, it is possible to use the anti-monotonicity as a filtering criterion in a
multistep clustering scheme (EDSC) [ 11 ]. The idea is to generate a conservative
approximation of subspace clusters based on cells that are merged if potentially
density-connected. Similar in spirit to the anti-monotonicity in Apriori, pruning is
based on the weakest density measure as a filter step.
The idea of avoiding full lattice search in favor of more efficient runtimes (i.e.,
the colossal pattern idea [ 82 ] we saw above) is also found for subspace clustering
[ 64 ]. Instead of analyzing all subspaces, and the entire value ranges within these
subspaces, the idea is to represent subspace clusters at different levels of approxima-
tion. Using the number of objects within the current approximation as an indication,
potential combinations with other subspaces are used as an indication of higher-
dimensional subspace clusters. Priority queues are maintained in order to generate
the most promising candidates in the lattice first. As a result, it becomes possible
to avoid the generation of many relatively low-dimensional subspace clusters and to
steer the search towards high-dimensional subspace clusters directly.
Another interesting connection to frequent pattern mining is discussed with the
algorithm INSCY for density-based subspace clustering [ 12 ]: subspace clusters are
detected based on a frequent cell count data representation, an index structure that
is similar in spirit to the FP-tree from frequent itemset mining. As mentioned in
the previous section, the challenge here is two-fold: first, to define an adequate
representation of subspace regions (the items), and second, to identify similarities
among these subspace regions. For the first part, a discretization technique as in
 
Search WWH ::




Custom Search