Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

h ( x 1 ,x 2 )=max

p ( y ( x 1 )= i

p ( y ( x 2 )= j

Θ) ,

Θ)

max

(7.14)

where y i is the label assignment for point x i .

The DistBoost algorithm computes the weights of the “weak” distance func-

tions using Boosting, and updates the weights on pairs of instances, which are

translated to weights on individual data instances. This is again passed back

to the input of the GMM-EM algorithm, and the process is repeated for mul-

tiple steps.

7.6 Satisfying Constraints and Learning Distance Func-

tions

As mentioned in Section 7.2, there have been some algorithms that try to

both enforce constraints and learn distance functions from constraints for par-

titional clustering algorithms, e.g., KMeans. In this section we will outline one

such algorithm, which uses the framework of a generative probabilistic model,

the Hidden Markov Random Field (HMRF) (6). It can cluster text docu-

ments using either cosine distance or Euclidean distance on L 2 -normalized

input data, doing both constraint satisfaction and distance learning in the

process.

7.6.1 Hidden Markov Random Field (HMRF) Model

The Hidden Markov Random Field (HMRF) is a probabilistic generative

model for semi-supervised constrained clustering, consisting of the following

components: (1) an observable set X =( x 1 ,...,x n ) of random variables,

corresponding to the given data instances X ; (2) an unobservable (hidden)

set Y =( y 1 ,...,y n ) of random variables, corresponding to cluster assign-

ments of instances in X , y i ∈

(1 ,...,K ); (3) an unobservable (hidden) set of

generative model parameters Θ, which consists of distance measure param-

eters A (typically a matrix or vector of weights) and cluster representatives

M =( μ 1 ,...,μ K ): Θ =

; (4) an observable set of constraint variables

C =( c 12 ,c 13 ,...,c n− 1 ,n ). Each c ij is a tertiary variable taking on a value

from the set (

{

A, M

}

−

1 , 0 , 1), where c ij = 1 indicates that ( x i ,x j )

∈

C ML , c ij =

−

indicates that ( x i ,x j )

C CL ,and c ij = 0 corresponds to pairs ( x i ,x j )that

are not constrained. The constraints are accompanied by associated violation

costs W ,where w ij represents the cost of violating the constraint between in-

stances x i and x j if such a constraint exists. Fig. 7.8 shows a simple example

of an HMRF having five data instances partitioned into three clusters, while

maximally respecting three pairwise constraints.

The joint probability of X , Y ,andΘ,given C , in the described HMRF

∈

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home