Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Input: A dataset X =

{

x 1 ,...,x n }

to cluster, k : the number of clusters to

find.

Output: A partition of X, Π k =

{

π 1 ,...,π k }

into k clusters that is a local

optima of the VQE (Equation 7.2).

1. Randomly generate cluster centroids μ 1 ,..., μ k .

2. loop until convergence do

(a) for i =1 to

do

(a.1) Assign x i to the nearest cluster π j , where nearness is

measured in terms of distance from x i to centroid μ j .

(b) Recalculate centroids μ 1 ,...,μ k according to Equation 7.5

|

X

|

3. Return the final partitions.

FIGURE 7.5 : Clustering using KMeans.

super-instance, whose co-ordinates are the average of the connected compo-

nent's and whose weight is equal to the number of instances within it (lines 1

and 2). Secondly, rather than performing a nearest centroid assignment (step

2a.1) in Figure 7.4 ), a nearest feasible centroid assignment is performed (lines

4a.1), where an assignment is feasible if it does not violate any cannot-link

constraints. When performing the nearest feasible centroid assignment step

the previous set partition is forgotten and the new partition built up incre-

mentally. Therefore, the first instance assigned to a cluster can never violate

any constraints, even if it is involved in many. Similarly if there is only one

constraint, c = ( x, y ), if x is assigned first then y is assigned to its closest feasi-

ble centroid and the assignment of x is not revisited. In this way, we can view

this algorithm as greedily trying to attempt constructing a feasible clustering

with no backtracking of previous instance assignments.

Natural variations of trying to satisfy all constraints are: a) attempting to

satisfy as many constraints as possible while ignoring noisy or inappropriate

constraints and b) having degrees of belief/importance associated with each

constraint. Both can be viewed as frameworks that allow trading of satisfying

the lesser important constraints.

7.4.2 Algorithms with Penalties - PKM, CVQE

The COP-KMeans algorithm can (see Section 7.4.1 ) improve the accuracy

at predicting an extrinsic label and also shape clusters into desirable forms.

However, when constraints are generated from labeled data there is the pos-

sibility of class label noise, thereby generating incorrect cannot-link or must-

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home