Databases Reference
In-Depth Information
t
n X
i D1 .
2
s
x i x 0 /
nSS 2 LS 2 C nLS
n 2
R D
D
,
(10.9)
n
t
n X
n X
j D1 .
2
x i x j /
s 2 nSS 2 LS 2
n
i D1
D D
D
.
(10.10)
n
.
n 1
/
.
n 1
/
Here, R is the average distance from member objects to the centroid, and D is the aver-
age pairwise distance within a cluster. Both R and D reflect the tightness of the cluster
around the centroid.
Summarizing a cluster using the clustering feature can avoid storing the detailed
information about individual objects or points. Instead, we only need a constant size
of space to store the clustering feature. This is the key to BIRCH efficiency in space.
Moreover, clustering features are additive . That is, for two disjoint clusters, C 1 and C 2 ,
with the clustering features CF 1 Dh n 1 , LS 1 , SS 1 i and CF 2 Dh n 2 , LS 2 , SS 2 i, respectively,
the clustering feature for the cluster that formed by merging C 1 and C 2 is simply
CF 1 C CF 2 Dh n 1 C n 2 , LS 1 C LS 2 , SS 1 C SS 2 i.
(10.11)
Example10.5 Clustering feature. Suppose there are three points,
.
2, 5
/
,
.
3, 2
/
, and
.
4, 3
/
, in a cluster,
C 1 . The clustering feature of C 1 is
2 2 C3 2 C4 2 , 5 2 C2 2 C3 2
CF 1 Dh3,
.
2C3C4, 5C2C3
/
,
.
/iDh3,
.
9, 10
/
,
.
29, 38
/i.
Suppose that C 1 is disjoint to a second cluster, C 2 , where CF 2 Dh3,
/i.
The clustering feature of a new cluster, C 3 , that is formed by merging C 1 and C 2 , is
derived by adding CF 1 and CF 2 . That is,
.
35, 36
/
,
.
417, 440
CF 3 Dh3C3,
.
9C35, 10C36
/
,
.
29C417, 38C440
/iDh6,
.
44, 46
/
,
.
446, 478
/i.
A CF-tree is a height-balanced tree that stores the clustering features for a hierar-
chical clustering. An example is shown in Figure 10.9. By definition, a nonleaf node in
a tree has descendants or “children.” The nonleaf nodes store sums of the CFs of their
children, and thus summarize clustering information about their children. A CF-tree
has two parameters: branching factor , B , and threshold , T . The branching factor specifies
the maximum number of children per nonleaf node. The threshold parameter specifies
the maximum diameter of subclusters stored at the leaf nodes of the tree. These two
parameters implicitly control the resulting tree's size.
Given a limited amount of main memory, an important consideration in BIRCH
is to minimize the time required for input/output (I/O). BIRCH applies a multiphase
clustering technique: A single scan of the data set yields a basic, good clustering, and
 
Search WWH ::




Custom Search