Database Reference
In-Depth Information
y dimension
0:10
2:4
1:13
1
x dimension
0:2
1:5
2:4
3:2
1:4
1:2
2:3
3:5
2
Fig. 16.8 SCY-tree index for depth-first mining of subspace clusters. Nodes contain cell counts as
in frequent itemset mining. Levels correspond to different dimensions, and additional marker nodes
indicate that a border is non-empty and that cells need to be merged during mining. For example,
the gray shaded node labeled '2' at the bottom corresponds to the non-empty border of cell '2' in
dimension x in Fig. 16.7
algorithm yet independent of the cluster model. In the first scenario, we can regard
subspace search as a global identification of 'interesting' subspaces—subspaces in
which we expect clusters to exist—and hence as a restriction of the search space. In
the second scenario, we observe a local identification of 'interesting' subspaces. A
typical use case of these 'locally interesting' subspaces is to adapt distance measures
locally, that is, for different clusters, different measures of similarity are applied.
Global Subspace Search ENCLUS [ 21 ] is based on an assessment of the subspace
as a whole, i.e., a subspace search step proceeds the actual subspace clustering. In
order to determine interesting subspaces, Shannon Entropy [ 73 ] is used. Entropy
measures the uncertainty in a random variable, where a high value means a high
level of uncertainty. A uniform distribution implies greatest uncertainty, so a low
entropy value (below some threshold) is used as an indication of subspace clusters.
Similar to CLIQUE, the data are discretized into equal-width cells before entropy
assessment. Monotonicity is based on the fact that an additional attribute can only
increase the uncertainty and thereby the Shannon Entropy:
T
has low entropy
⇒∀ S T
:
S
has low entropy.
(16.5)
Besides this Apriori bottom-up part of the algorithm, an additional mutual infor-
mation criterion is used for top-down pruning. Interesting subspaces in this sense are
those with an entropy that is lower (by some threshold) than the sum of the entropy
of each of its one-dimensional subspaces. Using both criteria, the most interesting
subspaces for subspace clustering according to ENCLUS are located neither at the
top nor at the bottom of the subspace search space, but at some medium dimension-
ality. This resembles the concept of borders in frequent itemset mining (Sect. 2.3 ).
While there borders are used to derive a condensed representation of the result set,
here, the result set is restricted to reduce the redundancy of too many clusters.
For RIS (Ranking Interesting Subspaces) [ 47 ], subspaces are 'interesting' if they
have a large number of points in the neighborhoods of core points (i.e., points with a
high local point density according to some thresholds), normalized by the expected
number of points assuming uniform distribution. While this criterion adopts a density-
based notion [ 51 ] of 'interesting', it is not tied to a specific clustering algorithm.
 
Search WWH ::




Custom Search