Database Reference
In-Depth Information
Yahoo News (K-series). This compilation has 2340 Yahoo news arti-
cles from 20 different categories. The underlying clusters in this dataset
are highly skewed in terms of the number of documents per cluster, with
sizes ranging from 9 to 494. The skewness presents additional challenges
to clustering algorithms.
20 Newsgroup. The 20 Newsgroup dataset is a widely used com-
pilation of documents (28). We tested our algorithms on not only the
original dataset, but on a variety of subsets with differing characteristics
to explore and understand the behavior of our algorithms.
1. News20 is a standard dataset that comprises 19,997 messages,
gathered from 20 different USENET newsgroups. One thousand
messages are drawn from the first 19 newsgroups, and 997 from
the twentieth. The headers for each of the messages are then re-
moved to avoid biasing the results. The particular vector space
model used had 25924 words. News20 embodies the features char-
acteristic of a typical text dataset—high-dimensionality, sparsity,
and significantly overlapping clusters.
2. Small-news20 is formed by selecting 2000 messages from the orig-
inal News20 dataset. We randomly selected 100 messages from each
category in the original dataset. Hence this dataset has balanced
classes (though there may be overlap). The dimensionality of the
data was 13406.
3. Same-100/1000 is a collection of 100/1000 messages from 3
very similar newsgroups:,,
4. Similar-100/1000 is a collection of 100/1000 messages from 3
somewhat similar newsgroups: talk.politics.
5. Different-100/1000 is a collection of 100/1000 messages from
3 very different newsgroups:
Slash-dot. We harvested news articles from the Slashdot website and
created 2 datasets. For each category in these datasets, we collected
1000 articles primarily tagged with the category label, and then removed
articles that were posted to multiple categories.
1. Slash-7 contains 6714 news articles posted to 7 Slashdot cate-
gories: Business, Education, Entertainment, Games, Music, Sci-
ence, and Internet.
2. Slash-6 contains 5182 articles posted to the 6 categories: Biotech,
Microsoft, Privacy, Google, Security, Space.
Search WWH ::

Custom Search