Databases Reference
In-Depth Information
is O
, where g is the total number of grid cells at the lowest level, which is usually much
smaller than n .
Because STING uses a multiresolution approach to cluster analysis, the quality of
STING clustering depends on the granularity of the lowest level of the grid structure. If
the granularity is very fine, the cost of processing will increase substantially; however, if
the bottom level of the grid structure is too coarse, it may reduce the quality of cluster
analysis. Moreover, STING does not consider the spatial relationship between the child-
ren and their neighboring cells for construction of a parent cell. As a result, the shapes
of the resulting clusters are isothetic, that is, all the cluster boundaries are either hori-
zontal or vertical, and no diagonal boundary is detected. This may lower the quality and
accuracy of the clusters despite the fast processing time of the technique.
.
g
/
10.5.2 CLIQUE:AnApriori-likeSubspaceClusteringMethod
A data object often has tens of attributes, many of which may be irrelevant. The val-
ues of attributes may vary considerably. These factors can make it difficult to locate
clusters that span the entire data space. It may be more meaningful to instead search
for clusters within different subspaces of the data. For example, consider a health-
informatics application where patient records contain extensive attributes describing
personal information, numerous symptoms, conditions, and family history.
Finding a nontrivial group of patients for which all or even most of the attributes
strongly agree is unlikely. In bird flu patients, for instance, the age , gender , and job
attributes may vary dramatically within a wide range of values. Thus, it can be difficult
to find such a cluster within the entire data space. Instead, by searching in subspaces, we
may find a cluster of similar patients in a lower-dimensional space (e.g., patients who
are similar to one other with respect to symptoms like high fever, cough but no runny
nose, and aged between 3 and 16).
CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density-
based clusters in subspaces. CLIQUE partitions each dimension into nonoverlapping
intervals, thereby partitioning the entire embedding space of the data objects into cells.
It uses a density threshold to identify dense cells and sparse ones. A cell is dense if the
number of objects mapped to it exceeds the density threshold.
The main strategy behind CLIQUE for identifying a candidate search space uses the
monotonicity of dense cells with respect to dimensionality. This is based on the Apriori
property used in frequent pattern and association rule mining (Chapter 6). In the con-
text of clusters in subspaces, the monotonicity says the following. A k -dimensional cell c
.
k
>
1
/
can have at least l points only if every
.
k 1
/
-dimensional projection of c , which
is a cell in a
-dimensional subspace, has at least l points. Consider Figure 10.20,
where the embedding data space contains three dimensions: age, salary , and vacation .
A 2-D cell, say in the subspace formed by age and salary , contains l points only if the
projection of this cell in every dimension, that is, age and salary , respectively, contains
at least l points.
CLIQUE performs clustering in two steps. In the first step, CLIQUE partitions
the d -dimensional data space into nonoverlapping rectangular units, identifying the
dense units among these. CLIQUE finds dense cells in all of the subspaces. To do so,
.
k 1
/
 
Search WWH ::




Custom Search