Generating Concept Hierarchies from User Queries - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

1. for 1 ≤ i ≤ N

2. C i = {v i }

3. f ( i )= true

4. for 1 ≤ i ≤ N

5. for i ≤ j ≤ N

6. S i,j = sim AL ( v i ,v j )

7. for 1 ≤ i ≤ N

8. choose most similar pair {C a ,C b } with f ( a ) ∧ f ( b )true

9. C n + i = C a ∪ C b , left ( C n + i )= C a , right ( C n + i )= C b

10. f ( n + i )= true, f ( a )= false, f ( b )= false

11. for 1 ≤ k ≤ N + i − 1

12.

S k,n + i = sim AL ( C k ,C n + i )

Fig. 4. HAC clustering algorithm

The algorithm is simple: place each query into its own single-element clus-

ter, then recursively select the two most similar clusters, join them together

into a new cluster, and update the similarity matrix with the new cluster.

The clusters are treated as the nodes in a binary tree, where the children of

a cluster are the two clusters that were merged to form it.

There are a number of different mechanisms for computing the similarity

between two clusters; HAC + P + FSR uses the average linkage inter-cluster

similarity, which is defined in (2):

1

sim AL ( C i ,C j )=

sim ( v a ,v b )

(2)

|

C i ||

C j |

v a ∈C i

v b ∈C j

This similarity measure, also referred to in the literature as UPGMA, has

been experimentally shown by Steinbach et al. to be the best choice for HAC

clustering [12].

This calculation requires the computation of the similarity between two

feature vectors; HAC+P+FSR uses the cosine distance metric, defined in (3):

⎧

⎨

0 , a = 0 or b = 0

t j ∈T v a,j , v b,j

t j ∈T v a,j

sim ( v a ,v b )=

(3)

,otherwise

t j ∈T v b,j

⎩

The values of this metric are in the range [0 , 1], where 0 indicates no simi-

larity and 1 indicates an exact match. This distance measure has been exper-

imentally determined by Steinbach et al. [12] to be well suited for measuring

the similarity between text document feature vectors.

Note that, as suggested by Dhillon et al. [13], and by Jain et al. [8], it is only

necessary to evaluate the cosine similarity between each pair of documents

once; these can be stored in an upper triangular matrix and used for all

successive computations of the average link similarity.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home