Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

time not scale down the diculty of the tasks. (2) Clustering small number of

sparse high-dimensional data instances is a likely scenario in realistic applica-

tions. For example, when clustering the search results in a web-search engine

like Viv ısimo, typically the number of webpages that are being clustered is in

the order of hundreds. However the dimensionality of the feature space, cor-

responding to the number of unique words in all the webpages, is in the order

of thousands. Moreover, each webpage is sparse, since it contains only a small

number of all the possible words. On such datasets, clustering algorithms can

easily get stuck in local optima: in such cases it has been observed that there

is little reassignments of documents between clusters for most initializations,

which leads to poor clustering quality after convergence of the algorithm (20).

Supervision in the form of pairwise constraints is most beneficial in such cases

and may significantly improve clustering quality.

Three datasets were derived from the 20-Newsgroups collection. 3 This col-

lection has messages harvested from 20 different Usenet newsgroups, 1000

messages from each newsgroup. From the original dataset, a reduced dataset

was created by taking a random subsample of 100 documents from each of the

20 newsgroups. Three datasets were created by selecting 3 categories from

the reduced collection. News-Similar-3 consists of 3 newsgroups on similar

topics ( comp.graphics , comp.os.ms-windows , comp.windows.x ) with signif-

icant overlap between clusters due to cross-posting. News-Related-3 consists of

3 newsgroups on related topics ( talk.politics.misc , talk.politics.guns ,

and talk.politics.mideast ). News-Different-3 consists of articles posted in

3 newsgroups that cover different topics ( alt.atheism , rec.sport.baseball ,

sci.space ) with well-separated clusters. All the text datasets were pre-

processed using the techniques outlined in Section 7.3.1.

Table 7.1 summarizes the properties of these datasets.

TABLE 7.1:

Text datasets used in experimental evaluation

News-Different-3

News-Related-3

News-Similar-3

Instances

300

Dimensions

3251

3225

1864

Classes

3

7.7.2 Clustering Evaluation

Normalized mutual information (NMI) was used as the clustering evaluation

measure. NMI is an external clustering validation metric that estimates the

quality of the clustering with respect to a given underlying class labeling of

3 http://www.ai.mit.edu/people/jrennie/20Newsgroups

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home