Databases Reference
In-Depth Information
Given a data set and a set of constraints on instances (i.e., must-link or cannot-link
constraints), how can we extend the
k
-means method to satisfy such constraints? The
COP-
k
-means algorithm
works as follows:
1.
Generate superinstances for must-link constraints
. Compute the transitive clo-
sure of the must-link constraints. Here, all must-link constraints are treated as an
equivalence relation. The closure gives one or multiple subsets of objects where all
objects in a subset must be assigned to one cluster. To represent such a subset, we
replace all those objects in the subset by the mean. The superinstance also carries a
weight, which is the number of objects it represents.
After this step, the must-link constraints are always satisfied.
2.
Conduct modified k-means clustering.
Recall that, in
k
-means, an object is assigned
to the closest center. What if a nearest-center assignment violates a cannot-link con-
straint? To respect cannot-link constraints, we modify the center assignment process
in
k
-means to a
nearest feasible center assignment
. That is, when the objects are
assigned to centers in sequence, at each step we make sure the assignments so far
do not violate any cannot-link constraints. An object is assigned to the nearest center
so that the assignment respects all cannot-link constraints.
Because COP-
k
-means ensures that no constraints are violated at every step, it does
not require any backtracking. It is a greedy algorithm for generating a clustering that
satisfies all constraints, provided that no conflicts exist among the constraints.
HandlingSoftConstraints
Clustering with soft constraints is an optimization problem. When a clustering violates a
soft constraint, a penalty is imposed on the clustering. Therefore, the optimization goal
of the clustering contains two parts: optimizing the clustering quality and minimizing
the constraint violation penalty. The overall objective function is a combination of the
clustering quality score and the penalty score.
To illustrate, we again use partitioning clustering as an example. Given a data set
and a set of soft constraints on instances, the
CVQE
(
Constrained Vector Quanti-
zation Error) algorithm
conducts
k
-means clustering while enforcing constraint vio-
lation penalties. The objective function used in CVQE is the sum of the distance used
in
k
-means, adjusted by the constraint violation penalties, which are calculated as
follows.
Penalty of a must-link violation.
If there is a must-link constraint on objects
x
and
y
, but they are assigned to two different centers,
c
1
and
c
2
, respectively, then the con-
straint is violated. As a result,
dist
.
c
1
,
c
2
/
, the distance between
c
1
and
c
2
, is added to
the objective function as the penalty.
Penalty of a cannot-link violation.
If there is a cannot-link constraint on objects
x
and
y
, but they are assigned to a common center,
c
, then the constraint is violated.