Database Reference
In-Depth Information
time not scale down the diculty of the tasks. (2) Clustering small number of
sparse high-dimensional data instances is a likely scenario in realistic applica-
tions. For example, when clustering the search results in a web-search engine
like Viv ısimo, typically the number of webpages that are being clustered is in
the order of hundreds. However the dimensionality of the feature space, cor-
responding to the number of unique words in all the webpages, is in the order
of thousands. Moreover, each webpage is sparse, since it contains only a small
number of all the possible words. On such datasets, clustering algorithms can
easily get stuck in local optima: in such cases it has been observed that there
is little reassignments of documents between clusters for most initializations,
which leads to poor clustering quality after convergence of the algorithm (20).
Supervision in the form of pairwise constraints is most beneficial in such cases
and may significantly improve clustering quality.
Three datasets were derived from the 20-Newsgroups collection. 3 This col-
lection has messages harvested from 20 different Usenet newsgroups, 1000
messages from each newsgroup. From the original dataset, a reduced dataset
was created by taking a random subsample of 100 documents from each of the
20 newsgroups. Three datasets were created by selecting 3 categories from
the reduced collection. News-Similar-3 consists of 3 newsgroups on similar
topics ( comp.graphics , comp.os.ms-windows , comp.windows.x ) with signif-
icant overlap between clusters due to cross-posting. News-Related-3 consists of
3 newsgroups on related topics ( talk.politics.misc , talk.politics.guns ,
and talk.politics.mideast ). News-Different-3 consists of articles posted in
3 newsgroups that cover different topics ( alt.atheism , rec.sport.baseball ,
sci.space ) with well-separated clusters. All the text datasets were pre-
processed using the techniques outlined in Section 7.3.1.
Table 7.1 summarizes the properties of these datasets.
TABLE 7.1:
Text datasets used in experimental evaluation
News-Different-3
News-Related-3
News-Similar-3
Instances
300
300
300
Dimensions
3251
3225
1864
Classes
3
3
3
7.7.2 Clustering Evaluation
Normalized mutual information (NMI) was used as the clustering evaluation
measure. NMI is an external clustering validation metric that estimates the
quality of the clustering with respect to a given underlying class labeling of
3 http://www.ai.mit.edu/people/jrennie/20Newsgroups
Search WWH ::




Custom Search