Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

CLIQUE partitions every dimension into intervals, and identifies intervals containing

at least l points, where l is the density threshold. CLIQUE then iteratively joins two

k -dimensional dense cells, c 1 and c 2 , in subspaces

respectively, if D i 1 D D j 1 , . . . , D i k 1 D D j k 1 , and c 1 and c 2 share the same intervals in

those dimensions. The join operation generates a new

D i 1 ,

:::

, D i k /

and

D j 1 ,

:::

, D j k /

k C1

-dimensional candidate

cell c in space

. CLIQUE checks whether the number of points

in c passes the density threshold. The iteration terminates when no candidates can be

generated or no candidate cells are dense.

In the second step, CLIQUE uses the dense cells in each subspace to assemble clusters,

which can be of arbitrary shape. The idea is to apply the Minimum Description Length

(MDL) principle (Chapter 8) to use the maximal regions to cover connected dense cells,

where a maximal region is a hyperrectangle where every cell falling into this region is

dense, and the region cannot be extended further in any dimension in the subspace.

Finding the best description of a cluster in general is NP-Hard. Thus, CLIQUE adopts

a simple greedy approach. It starts with an arbitrary dense cell, finds a maximal region

covering the cell, and then works on the remaining dense cells that have not yet been

covered. The greedy method terminates when all dense cells are covered.

“How effective is CLIQUE?” CLIQUE automatically finds subspaces of the highest

dimensionality such that high-density clusters exist in those subspaces. It is insensitive

to the order of input objects and does not presume any canonical data distribution. It

scales linearly with the size of the input and has good scalability as the number of dimen-

sions in the data is increased. However, obtaining a meaningful clustering is dependent

on proper tuning of the grid size (which is a stable structure here) and the density

threshold. This can be difficult in practice because the grid size and density threshold

are used across all combinations of dimensions in the data set. Thus, the accuracy of the

clustering results may be degraded at the expense of the method's simplicity. Moreover,

for a given dense region, all projections of the region onto lower-dimensionality sub-

spaces will also be dense. This can result in a large overlap among the reported dense

regions. Furthermore, it is difficult to find clusters of rather different densities within

different dimensional subspaces.

Several extensions to this approach follow a similar philosophy. For example, we can

think of a grid as a set of fixed bins. Instead of using fixed bins for each of the dimensions,

we can use an adaptive, data-driven strategy to dynamically determine the bins for each

dimension based on data distribution statistics. Alternatively, instead of using a den-

sity threshold, we may use entropy (Chapter 8) as a measure of the quality of subspace

clusters.

D i 1 ,

:::

, D i k 1 , D i k , D j k /

10.6 EvaluationofClustering

By now you have learned what clustering is and know several popular clustering meth-

ods. You may ask, “ When I try out a clustering method on a data set, how can I

evaluate whether the clustering results are good? ” In general, cluster evaluation assesses

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home