Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Algorithm: HMRF-KMeans

Input: Set of data points X =

i =1 , number of clusters K ,

set of must-link constraints C ML =

{

x i }

{

( x i ,x j )

}

set of cannot-link constraints C CL =

{

( x i ,x j )

}

h =1 , constraint violation costs W .

Output: Disjoint K -partitioning

distortion measures

{

D h }

{

X h }

h =1 of X such that

J obj

objective function

is (locally) minimized.

Method:

1. Initialize the K clusters centroids

μ (0)

{

}

h =1 ,set t

←

2. Repeat until convergence

2a.

μ ( t )

h =1 , re-assign cluster labels

E-step :Given

{

}

{y ( t +1)

i =1 on the points {x i }

i =1 to minimize J obj .

}

M-step(A) : Given cluster labels {y ( t +1)

i =1 , re-calculate

2b.

}

cluster centroids {μ ( t +1)

h =1 to minimize J obj .

}

h =1 to reduce

2c.

M-step(B) : Re-estimate distortion measures

{D h }

J obj .

2d.

←

t+1

FIGURE 7.10 : HMRF-KMeans algorithm.

7.7 Experiments

This section describes the experiments that were performed to demonstrate

the effectiveness of various types of constrained clustering algorithms on text

data. We have taken one type of algorithm described earlier: constrained

based, distance based, and both. We use the work of Basu and collaborators

(6) that includes one algorithm of each type but retains the same underlying

implementations. This means the insights we shall find when comparing the

different algorithms are due to how the constraints are used rather than dif-

ferent initialization schemes for example. Experiments were run using both

Euclidean distance and cosine distance, since different algorithms outlined in

this chapter used different distance measures.

7.7.1 Datasets

We considered 3 text datasets that have the characteristics of being sparse,

high-dimensional, and having a small number of instances compared to the

dimensionality of the space. This is done for two reasons: (1) When cluster-

ing sparse high-dimensional data, e.g., text documents represented using the

vector space model, it is particularly dicult to cluster small datasets, as ob-

served by (20). The purpose of performing experiments on these subsets is to

scale down the sizes of the datasets for computational reasons but at the same

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home