Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

is O

, where g is the total number of grid cells at the lowest level, which is usually much

smaller than n .

Because STING uses a multiresolution approach to cluster analysis, the quality of

STING clustering depends on the granularity of the lowest level of the grid structure. If

the granularity is very fine, the cost of processing will increase substantially; however, if

the bottom level of the grid structure is too coarse, it may reduce the quality of cluster

analysis. Moreover, STING does not consider the spatial relationship between the child-

ren and their neighboring cells for construction of a parent cell. As a result, the shapes

of the resulting clusters are isothetic, that is, all the cluster boundaries are either hori-

zontal or vertical, and no diagonal boundary is detected. This may lower the quality and

accuracy of the clusters despite the fast processing time of the technique.

.

g

/

10.5.2 CLIQUE:AnApriori-likeSubspaceClusteringMethod

A data object often has tens of attributes, many of which may be irrelevant. The val-

ues of attributes may vary considerably. These factors can make it difficult to locate

clusters that span the entire data space. It may be more meaningful to instead search

for clusters within different subspaces of the data. For example, consider a health-

informatics application where patient records contain extensive attributes describing

personal information, numerous symptoms, conditions, and family history.

Finding a nontrivial group of patients for which all or even most of the attributes

strongly agree is unlikely. In bird flu patients, for instance, the age , gender , and job

attributes may vary dramatically within a wide range of values. Thus, it can be difficult

to find such a cluster within the entire data space. Instead, by searching in subspaces, we

may find a cluster of similar patients in a lower-dimensional space (e.g., patients who

are similar to one other with respect to symptoms like high fever, cough but no runny

nose, and aged between 3 and 16).

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density-

based clusters in subspaces. CLIQUE partitions each dimension into nonoverlapping

intervals, thereby partitioning the entire embedding space of the data objects into cells.

It uses a density threshold to identify dense cells and sparse ones. A cell is dense if the

number of objects mapped to it exceeds the density threshold.

The main strategy behind CLIQUE for identifying a candidate search space uses the

monotonicity of dense cells with respect to dimensionality. This is based on the Apriori

property used in frequent pattern and association rule mining (Chapter 6). In the con-

text of clusters in subspaces, the monotonicity says the following. A k -dimensional cell c

.

k

>

1

/

can have at least l points only if every

.

k 1

/

-dimensional projection of c , which

is a cell in a

-dimensional subspace, has at least l points. Consider Figure 10.20,

where the embedding data space contains three dimensions: age, salary , and vacation .

A 2-D cell, say in the subspace formed by age and salary , contains l points only if the

projection of this cell in every dimension, that is, age and salary , respectively, contains

at least l points.

CLIQUE performs clustering in two steps. In the first step, CLIQUE partitions

the d -dimensional data space into nonoverlapping rectangular units, identifying the

dense units among these. CLIQUE finds dense cells in all of the subspaces. To do so,

.

k 1

/

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home