Databases Reference
In-Depth Information
CLIQUE partitions every dimension into intervals, and identifies intervals containing
at least l points, where l is the density threshold. CLIQUE then iteratively joins two
k -dimensional dense cells, c 1 and c 2 , in subspaces
,
respectively, if D i 1 D D j 1 , . . . , D i k 1 D D j k 1 , and c 1 and c 2 share the same intervals in
those dimensions. The join operation generates a new
.
D i 1 ,
:::
, D i k /
and
.
D j 1 ,
:::
, D j k /
.
k C1
/
-dimensional candidate
cell c in space
. CLIQUE checks whether the number of points
in c passes the density threshold. The iteration terminates when no candidates can be
generated or no candidate cells are dense.
In the second step, CLIQUE uses the dense cells in each subspace to assemble clusters,
which can be of arbitrary shape. The idea is to apply the Minimum Description Length
(MDL) principle (Chapter 8) to use the maximal regions to cover connected dense cells,
where a maximal region is a hyperrectangle where every cell falling into this region is
dense, and the region cannot be extended further in any dimension in the subspace.
Finding the best description of a cluster in general is NP-Hard. Thus, CLIQUE adopts
a simple greedy approach. It starts with an arbitrary dense cell, finds a maximal region
covering the cell, and then works on the remaining dense cells that have not yet been
covered. The greedy method terminates when all dense cells are covered.
“How effective is CLIQUE?” CLIQUE automatically finds subspaces of the highest
dimensionality such that high-density clusters exist in those subspaces. It is insensitive
to the order of input objects and does not presume any canonical data distribution. It
scales linearly with the size of the input and has good scalability as the number of dimen-
sions in the data is increased. However, obtaining a meaningful clustering is dependent
on proper tuning of the grid size (which is a stable structure here) and the density
threshold. This can be difficult in practice because the grid size and density threshold
are used across all combinations of dimensions in the data set. Thus, the accuracy of the
clustering results may be degraded at the expense of the method's simplicity. Moreover,
for a given dense region, all projections of the region onto lower-dimensionality sub-
spaces will also be dense. This can result in a large overlap among the reported dense
regions. Furthermore, it is difficult to find clusters of rather different densities within
different dimensional subspaces.
Several extensions to this approach follow a similar philosophy. For example, we can
think of a grid as a set of fixed bins. Instead of using fixed bins for each of the dimensions,
we can use an adaptive, data-driven strategy to dynamically determine the bins for each
dimension based on data distribution statistics. Alternatively, instead of using a den-
sity threshold, we may use entropy (Chapter 8) as a measure of the quality of subspace
clusters.
.
D i 1 ,
:::
, D i k 1 , D i k , D j k /
10.6 EvaluationofClustering
By now you have learned what clustering is and know several popular clustering meth-
ods. You may ask, “ When I try out a clustering method on a data set, how can I
evaluate whether the clustering results are good? ” In general, cluster evaluation assesses
 
Search WWH ::




Custom Search