Database Reference
In-Depth Information
the data: it measures how closely the clustering algorithm could reconstruct
the underlying label distribution in the data (45). Therefore, the higher the
NMI the better. If Y is the random variable denoting the cluster assignments
of the instances and Y is the random variable denoting the underlying class
labels on the instances, then the NMI measure is defined as:
I ( Y ; Y )
( H ( Y )+ H ( Y )) / 2
NMI =
(7.21)
where I ( X ; Y )= H ( X )
Y ) is the mutual information between the
random variables X and Y , H ( X ) is the Shannon entropy of X ,and H ( X
H ( X
|
Y )
is the conditional entropy of X given Y (13). NMI effectively measures the
amount of statistical information shared by the random variables representing
the cluster assignments and the user-labeled class assignments of the data
instances. Though various clustering evaluation measures have been used
in the literature, NMI and its variants have become popular lately among
clustering practitioners (22; 23; 37).
|
7.7.3 Methodology
Learning curves were generated using two-fold cross-validation performed
over 20 runs on each dataset. In every trial, 50% of the dataset was set
aside as the training fold. Every point on the learning curve corresponds to
the number of constraints on pairs of data instances from the training fold.
These constraints are obtained by randomly selecting pairs of instances from
the training fold and creating must-link or cannot-link constraints depending
on whether the underlying classes of the two instances are same or different.
Unit constraint costs W were used for all constraints (original and inferred),
since the datasets did not provide individual weights for the constraints. The
clustering algorithm was run on the whole dataset, but NMI was calculated
using instances in the test fold.
7.7.4 Comparison of Distance Functions
Figure 7.11 shows the results of running constrained PKM clustering on
News-Same-3 and News-Different-3 , using both Euclidean and cosine dis-
tances. As shown in the figure, there is an improvement in the performance of
the algorithm with cosine distance over Euclidean distance, which is consistent
with previous research (40). Euclidean distance can be used if necessary for
constrained text clustering (e.g., for the algorithms outlined in Section 7.4),
with pre-normalization of the text documents. However, using cosine distance
is recommended by practitioners for constrained clustering of text datasets in
most domains, in which case algorithms like HMRF-KMeans are more use-
ful.
Search WWH ::




Custom Search