Database Reference
In-Depth Information
Input: X : A set of data instances to cluster, C = : set of pairwise must-link
constraints, C = : set of pairwise cannot-link constraints, k :thenumberof
clusters to find. Initially, the weight of each instance is 1.
Output: A partition of X, Π k =
into k clusters that is a local
optima of the VQE (Equation 7.2). and all constraints in C = C =
{
π 1 ,...,π k }
C = are
satisfied.
1. Compute the transitive closure of the set C = to obtain the connected
components CC 1 ,...,CC r .
2. For each i ,1
r , replace all the data instances in CC i by a single
instance with weight
i
; the instance's coordinates are obtained by
averaging the coordinates of the instances in CC i .
|
CC i |
3. Randomly generate cluster centroids μ 1 ,..., μ k .
4. loop until convergence do
(a) for i =1 to
|
X
|
do
(a.1) Assign x i to the nearest feasible cluster π j , where near-
ness is measured in terms of distance from x i to centroid μ j .
(a.2) If assignment of x i to any cluster always violates a con-
straint, then exit with failure.
(b) Recalculate centroids μ 1 ,...,μ k taking into account the weight
of the instances in X using Equation 7.5
5. Return final partitions.
FIGURE 7.6 : Clustering under constraints using COP-KMeans.
link constraint between instances. Similarly, if constraints are generated by
domain experts, some constraints may be ill-specified or even contradictory.
The two algorithms in this subsection attempt to ignore noisy or inappropriate
constraints by allowing constraints to be left unsatisfied but with a penalty.
This involves a trade-off between finding the best clustering and satisfying as
many constraints as possible. To achieve this, the penalty of ignoring a con-
straint must be in the same units as the measure for how good the clustering
of the data is. The CVQE algorithm uses distance as the fundamental unit
and the PKM uses probability. We now discuss these two algorithms.
7.4.2.1
CVQE
The core idea behind the CVQE algorithm is to penalize constraint viola-
tions using distance. If a must-link constraint is violated then the penalty
Search WWH ::




Custom Search