Databases Reference
In-Depth Information
1. for 1
≤ i ≤ N
2.
C
i
=
{v
i
}
3.
f
(
i
)=
true
4. for 1
≤ i ≤ N
5. for
i ≤ j ≤ N
6.
S
i,j
=
sim
AL
(
v
i
,v
j
)
7. for 1
≤ i ≤ N
8. choose most similar pair
{C
a
,C
b
}
with
f
(
a
)
∧ f
(
b
)true
9.
C
n
+
i
=
C
a
∪ C
b
, left
(
C
n
+
i
)=
C
a
, right
(
C
n
+
i
)=
C
b
10.
f
(
n
+
i
)=
true, f
(
a
)=
false, f
(
b
)=
false
11. for 1
≤ k ≤ N
+
i −
1
12.
S
k,n
+
i
=
sim
AL
(
C
k
,C
n
+
i
)
Fig. 4.
HAC clustering algorithm
The algorithm is simple: place each query into its own single-element clus-
ter, then recursively select the two most similar clusters, join them together
into a new cluster, and update the similarity matrix with the new cluster.
The clusters are treated as the nodes in a binary tree, where the children of
a cluster are the two clusters that were merged to form it.
There are a number of different mechanisms for computing the similarity
between two clusters; HAC + P + FSR uses the average linkage inter-cluster
similarity, which is defined in (2):
1
sim
AL
(
C
i
,C
j
)=
sim
(
v
a
,v
b
)
(2)
|
C
i
||
C
j
|
v
a
∈C
i
v
b
∈C
j
This similarity measure, also referred to in the literature as UPGMA, has
been experimentally shown by Steinbach et al. to be the best choice for HAC
clustering [12].
This calculation requires the computation of the similarity between two
feature vectors; HAC+P+FSR uses the cosine distance metric, defined in (3):
⎧
⎨
0
,
a
=
0
or
b
=
0
t
j
∈T
v
a,j
,
v
b,j
t
j
∈T
v
a,j
sim
(
v
a
,v
b
)=
(3)
,otherwise
t
j
∈T
v
b,j
⎩
The values of this metric are in the range [0
,
1], where 0 indicates no simi-
larity and 1 indicates an exact match. This distance measure has been exper-
imentally determined by Steinbach et al. [12] to be well suited for measuring
the similarity between text document feature vectors.
Note that, as suggested by Dhillon et al. [13], and by Jain et al. [8], it is only
necessary to evaluate the cosine similarity between each pair of documents
once; these can be stored in an upper triangular matrix and used for all
successive computations of the average link similarity.