Database Reference
In-Depth Information
7.4 Partitional Clustering with Constraints
Some of the very first algorithms that made use of constraints were varia-
tions of the popular KMeans iterative algorithm. The purpose of the KMeans
algorithm is to (locally) minimize the vector quantization error (also known
as the distortion) shown in Equation 7.2.
k
VQE
=
VQE
j
(7.2)
j
=1
2
x
i
∈
1
D
(
μ
j
,x
i
)
2
VQE
j
=
(7.3)
π
j
where
j
indexes over the clusters, and
k
is the number of clusters (specified
as an input parameter), and
D
is the distance function.
The KMeans algorithm is an iterative algorithm which in every step at-
tempts to further minimize the distortion. Given a set of cluster centroids,
the algorithm assigns instances to their
nearest
centroid which of course min-
imizes the distortion. This is step 1 of the algorithm. Step 2 is to recalculate
the cluster centroids so as to minimize the distortion. This can be achieved
by taking the first order derivative of the error (Equation 7.3) with respect to
the
j
th
centroid and setting it to zero and solving. A solution to the resulting
equation gives us the KMeans centroid update rule as shown in Equation 7.5.
d
(
x
i
∈π
j
D
(
μ
j
,x
i
)
2
)
d
(
μ
j
)
d
(
VQE
j
)
d
(
μ
j
)
=
=0
(7.4)
μ
j
=
x
i
∈π
j
⇒
x
i
/
|
π
j
|
(7.5)
Recall that
π
j
is the set of instances closest to the centroid of the
j
th
cluster.
These two steps are used in the standard KMeans algorithm shown in
Figure
7.4
.
7.4.1 COP-KMeans
The COP-KMeans algorithm shown in Figure 7.4.1 can be seen to be a
two part variation of the KMeans algorithm that incorporates conjunctions
of constraints. Firstly, the transitive closure over the must-linked instances
is computed, so that
c
=
(
x, y
)and
c
=
(
y, z
)
c
=
(
x, z
)and
x, y, z
form a con-
nected component. The resultant connected components are replaced by a
⇒
Search WWH ::
Custom Search