Database Reference
In-Depth Information
Input:
A dataset
X
=
{
x
1
,...,x
n
}
to cluster,
k
: the number of clusters to
find.
Output:
A partition of
X,
Π
k
=
{
π
1
,...,π
k
}
into
k
clusters that is a local
optima of the VQE (Equation 7.2).
1. Randomly generate cluster centroids
μ
1
,...,
μ
k
.
2.
loop
until convergence
do
(a)
for
i
=1
to
do
(a.1) Assign
x
i
to the nearest cluster
π
j
, where nearness is
measured in terms of distance from
x
i
to centroid
μ
j
.
(b) Recalculate centroids
μ
1
,...,μ
k
according to Equation 7.5
|
X
|
3. Return the final partitions.
FIGURE 7.5
: Clustering using KMeans.
super-instance, whose co-ordinates are the average of the connected compo-
nent's and whose weight is equal to the number of instances within it (lines 1
and 2). Secondly, rather than performing a nearest centroid assignment (step
2a.1) in
Figure 7.4
), a nearest
feasible
centroid assignment is performed (lines
4a.1), where an assignment is feasible if it does not violate any cannot-link
constraints. When performing the nearest feasible centroid assignment step
the previous set partition is
forgotten
and the new partition built up incre-
mentally. Therefore, the first instance assigned to a cluster can
never
violate
any constraints, even if it is involved in many. Similarly if there is only one
constraint,
c
=
(
x, y
), if
x
is assigned first then
y
is assigned to its closest feasi-
ble centroid and the assignment of
x
is not revisited. In this way, we can view
this algorithm as greedily trying to attempt constructing a feasible clustering
with
no
backtracking of previous instance assignments.
Natural variations of trying to satisfy all constraints are: a) attempting to
satisfy as many constraints as possible while ignoring noisy or inappropriate
constraints and b) having degrees of belief/importance associated with each
constraint. Both can be viewed as frameworks that allow trading of satisfying
the lesser important constraints.
7.4.2 Algorithms with Penalties - PKM, CVQE
The COP-KMeans algorithm can (see
Section 7.4.1
) improve the accuracy
at predicting an extrinsic label and also shape clusters into desirable forms.
However, when constraints are generated from labeled data there is the pos-
sibility of class label noise, thereby generating incorrect cannot-link or must-
Search WWH ::
Custom Search