Database Reference
In-Depth Information
7.7.5 Experimental Results
We performed experiments on text datasets with HMRF-KMeans,acom-
bined constraint-based and distance-based algorithm, to study the effective-
ness of each component of the algorithm. HMRF-KMeans was compared
with three ablations, as well as with unsupervised KMeans clustering. The
following variants were compared:
KMeans-C-D-R is the complete HMRF-KMeans algorithm that in-
corporates constraints in cluster assignments (C), includes distance
learning (D), and also performs weight regularization (R) using a Rayleigh
prior;
KMeans-C-D is the first ablation of HMRF-KMeans that includes all
components except for regularization of distance measure parameters;
KMeans-C is an ablation of HMRF-KMeans that uses pairwise super-
vision for initialization and cluster assignments, but does not perform
distance measure learning. This is equivalent to the PKM algorithm
mentioned in Section 7.4.
KMeans is the unsupervised K-Means algorithm.
The goal of these experiments was to evaluate the utility of each component
of the HMRF framework and identify settings in which particular components
are beneficial. Figures 7.12 , 7.13 , and 7.14 present the results for the ablation
experiments where weighted cosine distance was used as the distance measure.
As the results demonstrate, the full HMRF-KMeans algorithm with reg-
ularization (KMeans-C-D-R) outperforms the unsupervised K-Means base-
line as well as the ablated versions of the algorithm. As can be seen from
results for zero pairwise constraints in the figures, distance measure learning
is beneficial even in the absence of any pairwise constraints, since it allows cap-
turing the relative importance of the different attributes in the unsupervised
data. In the absence of supervised data or when no constraints are violated,
distance learning attempts to minimize the objective function by adjusting
the weights given the distortion between the unsupervised data instances and
their corresponding cluster representatives.
For these datasets, regularization is clearly beneficial to performance, as can
be seen from the improved performance of KMeans-C-D-R over KMeans-
C-D on all datasets. This can be explained by the fact that the number
of distance measure parameters is large for high-dimensional datasets, and
therefore algorithm-based estimates of parameters tend to be unreliable unless
they incorporate a prior.
Overall, these results show that the HMRF-KMeans algorithm effectively
incorporates constraints for doing both distance learning and constraint satis-
faction, each of which improves the quality of clustering for the text datasets
considered in the experiments.
Search WWH ::




Custom Search