Databases Reference
In-Depth Information
Given a data set and a set of constraints on instances (i.e., must-link or cannot-link
constraints), how can we extend the k -means method to satisfy such constraints? The
COP- k -means algorithm works as follows:
1. Generate superinstances for must-link constraints . Compute the transitive clo-
sure of the must-link constraints. Here, all must-link constraints are treated as an
equivalence relation. The closure gives one or multiple subsets of objects where all
objects in a subset must be assigned to one cluster. To represent such a subset, we
replace all those objects in the subset by the mean. The superinstance also carries a
weight, which is the number of objects it represents.
After this step, the must-link constraints are always satisfied.
2. Conduct modified k-means clustering. Recall that, in k -means, an object is assigned
to the closest center. What if a nearest-center assignment violates a cannot-link con-
straint? To respect cannot-link constraints, we modify the center assignment process
in k -means to a nearest feasible center assignment . That is, when the objects are
assigned to centers in sequence, at each step we make sure the assignments so far
do not violate any cannot-link constraints. An object is assigned to the nearest center
so that the assignment respects all cannot-link constraints.
Because COP- k -means ensures that no constraints are violated at every step, it does
not require any backtracking. It is a greedy algorithm for generating a clustering that
satisfies all constraints, provided that no conflicts exist among the constraints.
HandlingSoftConstraints
Clustering with soft constraints is an optimization problem. When a clustering violates a
soft constraint, a penalty is imposed on the clustering. Therefore, the optimization goal
of the clustering contains two parts: optimizing the clustering quality and minimizing
the constraint violation penalty. The overall objective function is a combination of the
clustering quality score and the penalty score.
To illustrate, we again use partitioning clustering as an example. Given a data set
and a set of soft constraints on instances, the CVQE ( Constrained Vector Quanti-
zation Error) algorithm conducts k -means clustering while enforcing constraint vio-
lation penalties. The objective function used in CVQE is the sum of the distance used
in k -means, adjusted by the constraint violation penalties, which are calculated as
follows.
Penalty of a must-link violation. If there is a must-link constraint on objects x and
y , but they are assigned to two different centers, c 1 and c 2 , respectively, then the con-
straint is violated. As a result, dist
.
c 1 , c 2 /
, the distance between c 1 and c 2 , is added to
the objective function as the penalty.
Penalty of a cannot-link violation. If there is a cannot-link constraint on objects x
and y , but they are assigned to a common center, c , then the constraint is violated.
 
Search WWH ::




Custom Search