Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

7.7.5 Experimental Results

We performed experiments on text datasets with HMRF-KMeans,acom-

bined constraint-based and distance-based algorithm, to study the effective-

ness of each component of the algorithm. HMRF-KMeans was compared

with three ablations, as well as with unsupervised KMeans clustering. The

following variants were compared:

• KMeans-C-D-R is the complete HMRF-KMeans algorithm that in-

corporates constraints in cluster assignments (C), includes distance

learning (D), and also performs weight regularization (R) using a Rayleigh

prior;

• KMeans-C-D is the first ablation of HMRF-KMeans that includes all

components except for regularization of distance measure parameters;

• KMeans-C is an ablation of HMRF-KMeans that uses pairwise super-

vision for initialization and cluster assignments, but does not perform

distance measure learning. This is equivalent to the PKM algorithm

mentioned in Section 7.4.

• KMeans is the unsupervised K-Means algorithm.

The goal of these experiments was to evaluate the utility of each component

of the HMRF framework and identify settings in which particular components

are beneficial. Figures 7.12 , 7.13 , and 7.14 present the results for the ablation

experiments where weighted cosine distance was used as the distance measure.

As the results demonstrate, the full HMRF-KMeans algorithm with reg-

ularization (KMeans-C-D-R) outperforms the unsupervised K-Means base-

line as well as the ablated versions of the algorithm. As can be seen from

results for zero pairwise constraints in the figures, distance measure learning

is beneficial even in the absence of any pairwise constraints, since it allows cap-

turing the relative importance of the different attributes in the unsupervised

data. In the absence of supervised data or when no constraints are violated,

distance learning attempts to minimize the objective function by adjusting

the weights given the distortion between the unsupervised data instances and

their corresponding cluster representatives.

For these datasets, regularization is clearly beneficial to performance, as can

be seen from the improved performance of KMeans-C-D-R over KMeans-

C-D on all datasets. This can be explained by the fact that the number

of distance measure parameters is large for high-dimensional datasets, and

therefore algorithm-based estimates of parameters tend to be unreliable unless

they incorporate a prior.

Overall, these results show that the HMRF-KMeans algorithm effectively

incorporates constraints for doing both distance learning and constraint satis-

faction, each of which improves the quality of clustering for the text datasets

considered in the experiments.

Search WWH ::

Custom Search

Home