Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

the data: it measures how closely the clustering algorithm could reconstruct

the underlying label distribution in the data (45). Therefore, the higher the

NMI the better. If Y is the random variable denoting the cluster assignments

of the instances and Y is the random variable denoting the underlying class

labels on the instances, then the NMI measure is defined as:

I ( Y ; Y )

( H ( Y )+ H ( Y )) / 2

NMI =

(7.21)

where I ( X ; Y )= H ( X )

Y ) is the mutual information between the

random variables X and Y , H ( X ) is the Shannon entropy of X ,and H ( X

−

H ( X

|

Y )

is the conditional entropy of X given Y (13). NMI effectively measures the

amount of statistical information shared by the random variables representing

the cluster assignments and the user-labeled class assignments of the data

instances. Though various clustering evaluation measures have been used

in the literature, NMI and its variants have become popular lately among

clustering practitioners (22; 23; 37).

|

7.7.3 Methodology

Learning curves were generated using two-fold cross-validation performed

over 20 runs on each dataset. In every trial, 50% of the dataset was set

aside as the training fold. Every point on the learning curve corresponds to

the number of constraints on pairs of data instances from the training fold.

These constraints are obtained by randomly selecting pairs of instances from

the training fold and creating must-link or cannot-link constraints depending

on whether the underlying classes of the two instances are same or different.

Unit constraint costs W were used for all constraints (original and inferred),

since the datasets did not provide individual weights for the constraints. The

clustering algorithm was run on the whole dataset, but NMI was calculated

using instances in the test fold.

7.7.4 Comparison of Distance Functions

Figure 7.11 shows the results of running constrained PKM clustering on

News-Same-3 and News-Different-3 , using both Euclidean and cosine dis-

tances. As shown in the figure, there is an improvement in the performance of

the algorithm with cosine distance over Euclidean distance, which is consistent

with previous research (40). Euclidean distance can be used if necessary for

constrained text clustering (e.g., for the algorithms outlined in Section 7.4),

with pre-normalization of the text documents. However, using cosine distance

is recommended by practitioners for constrained clustering of text datasets in

most domains, in which case algorithms like HMRF-KMeans are more use-

ful.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home