Database Reference
In-Depth Information
[ 19 , 22 , 34 , 46 ] approaches the problem from the opposite direction. It is based on
the notion of semantically different subspaces, i.e., multiple representations for the
same data. We cannot generally assume to know the different semantics of subspaces
beforehand and, accordingly, could find results in overlapping subspaces. As a con-
sequence, these approaches allow some redundancy between resulting clusters. A
certain partial overlap between concepts is allowed in order to not exclude possibly
interesting concepts.
A related though distinct way of addressing the problem of redundancy and dis-
tinctiveness of different clusters is to seek diverse clusterings by directly assessing a
certain notion of distance between different partitions (so-called alternative clustering
approaches [ 14 , 23 , 24 , 25 , 32 , 33 , 35 , 70 , 72 ]). Starting with one clustering solution,
they search for an alternative clustering solution that provides substantially differ-
ent insights. Still, alternative clustering solutions are allowed to not be absolutely
orthogonal but to show some redundancy with existing clustering solutions.
Apparently, to avoid redundancy as more 'enhanced' [ 74 ] subspace clustering al-
gorithms try to do should not be pursued as an absolute goal. Multiview clustering and
alternative clustering come from the other extreme and relax the original restriction
of 'no redundancy' more and more. Relationships between subspace clustering and
other families of clustering approaches have been discussed by Zimek and Vreeken
[ 84 ].
A question related to the redundancy issue is that of the appropriate density level.
Both of these issues have decisive influence on the clusters that are selected. Deter-
mining the right density level is a general problem also in full space density-based
clustering [ 51 ], but for clustering in subspaces, the problem is even more severe. Set-
ting a fixed density threshold for an Apriori style subspace search is not appropriate
for all possible subspaces. Consider for example any CLIQUE-style grid approach:
the volume of a hypercube increases exponentially with the dimensionality, hence the
density decreases rapidly. As a consequence, any chosen threshold introduces a bias
to identify clusters of (up to) a certain dimensionality. This observation motivates
research on adaptive density thresholds [ 10 , 62 ]. The algorithmic challenge then
comes from loss of monotonicity that would allow efficient traversal of the search
space of subspaces.
When using Euclidean distance ( L 2 ), the appropriate choice of an ε -range becomes
extremely challenging as well due to the rather counter-intuitive behavior of the
volume of the hypersphere with increasing dimensions. Let us note that, for outlier
detection, the very same problem occurs in high-dimensional data, which has been
discussed in detail by Zimek et al. [ 85 ]. Choosing the size of the neighborhood in
terms of objects rather than in terms of a radius (i.e., using k nearest neighbors instead
of an ε -range query) has been advocated as a workaround for this problem [ 2 ], to
solve at least certain aspects such as having a well-defined (non-empty) set of objects
for the density estimation or spatial properties of the neighborhood.
Search WWH ::




Custom Search