Database Reference
In-Depth Information
Algorithm: HMRF-KMeans
Input: Set of data points X =
i =1 , number of clusters K ,
set of must-link constraints C ML =
{
x i }
{
( x i ,x j )
}
,
set of cannot-link constraints C CL =
{
( x i ,x j )
}
,
h =1 , constraint violation costs W .
Output: Disjoint K -partitioning
distortion measures
{
D h }
{
X h }
h =1 of X such that
J obj
objective function
is (locally) minimized.
Method:
1. Initialize the K clusters centroids
μ (0)
h
{
}
h =1 ,set t
0
2. Repeat until convergence
2a.
μ ( t )
h
h =1 , re-assign cluster labels
E-step :Given
{
}
{y ( t +1)
i
i =1 on the points {x i }
i =1 to minimize J obj .
}
M-step(A) : Given cluster labels {y ( t +1)
i
i =1 , re-calculate
2b.
}
cluster centroids ( t +1)
h
h =1 to minimize J obj .
}
h =1 to reduce
2c.
M-step(B) : Re-estimate distortion measures
{D h }
J obj .
2d.
t
t+1
FIGURE 7.10 : HMRF-KMeans algorithm.
7.7 Experiments
This section describes the experiments that were performed to demonstrate
the effectiveness of various types of constrained clustering algorithms on text
data. We have taken one type of algorithm described earlier: constrained
based, distance based, and both. We use the work of Basu and collaborators
(6) that includes one algorithm of each type but retains the same underlying
implementations. This means the insights we shall find when comparing the
different algorithms are due to how the constraints are used rather than dif-
ferent initialization schemes for example. Experiments were run using both
Euclidean distance and cosine distance, since different algorithms outlined in
this chapter used different distance measures.
7.7.1 Datasets
We considered 3 text datasets that have the characteristics of being sparse,
high-dimensional, and having a small number of instances compared to the
dimensionality of the space. This is done for two reasons: (1) When cluster-
ing sparse high-dimensional data, e.g., text documents represented using the
vector space model, it is particularly dicult to cluster small datasets, as ob-
served by (20). The purpose of performing experiments on these subsets is to
scale down the sizes of the datasets for computational reasons but at the same
 
Search WWH ::




Custom Search