Databases Reference
In-Depth Information
1. for 1 ≤ i ≤ N
2. C i = {v i }
3. f ( i )= true
4. for 1 ≤ i ≤ N
5. for i ≤ j ≤ N
6. S i,j = sim AL ( v i ,v j )
7. for 1 ≤ i ≤ N
8. choose most similar pair {C a ,C b } with f ( a ) ∧ f ( b )true
9. C n + i = C a ∪ C b , left ( C n + i )= C a , right ( C n + i )= C b
10. f ( n + i )= true, f ( a )= false, f ( b )= false
11. for 1 ≤ k ≤ N + i − 1
12.
S k,n + i = sim AL ( C k ,C n + i )
Fig. 4. HAC clustering algorithm
The algorithm is simple: place each query into its own single-element clus-
ter, then recursively select the two most similar clusters, join them together
into a new cluster, and update the similarity matrix with the new cluster.
The clusters are treated as the nodes in a binary tree, where the children of
a cluster are the two clusters that were merged to form it.
There are a number of different mechanisms for computing the similarity
between two clusters; HAC + P + FSR uses the average linkage inter-cluster
similarity, which is defined in (2):
1
sim AL ( C i ,C j )=
sim ( v a ,v b )
(2)
|
C i ||
C j |
v a ∈C i
v b ∈C j
This similarity measure, also referred to in the literature as UPGMA, has
been experimentally shown by Steinbach et al. to be the best choice for HAC
clustering [12].
This calculation requires the computation of the similarity between two
feature vectors; HAC+P+FSR uses the cosine distance metric, defined in (3):
0 , a = 0 or b = 0
t j ∈T v a,j , v b,j
t j ∈T v a,j
sim ( v a ,v b )=
(3)
,otherwise
t j ∈T v b,j
The values of this metric are in the range [0 , 1], where 0 indicates no simi-
larity and 1 indicates an exact match. This distance measure has been exper-
imentally determined by Steinbach et al. [12] to be well suited for measuring
the similarity between text document feature vectors.
Note that, as suggested by Dhillon et al. [13], and by Jain et al. [8], it is only
necessary to evaluate the cosine similarity between each pair of documents
once; these can be stored in an upper triangular matrix and used for all
successive computations of the average link similarity.
 
Search WWH ::




Custom Search