Database Reference
In-Depth Information
Extracting the TF-IDF features from the 20
Newsgroups dataset
To illustrate the concepts in this chapter, we will use a well-known text dataset called 20
Newsgroups ; this dataset is commonly used for text-classification tasks. This is a collec-
tion of newsgroup messages posted across 20 different topics. There are various forms of
data available. For our purposes, we will use the bydate version of the dataset, which is
available at http://qwone.com/~jason/20Newsgroups .
This dataset splits up the available data into training and test sets that comprise 60 percent
and 40 percent of the original data, respectively. Here, the messages in the test set occur
after those in the training set. This dataset also excludes some of the message headers that
identify the actual newsgroup; hence, it is an appropriate dataset to test the real-world per-
formance of classification models.
Note
Further information on the original dataset can be found in the UCI Machine Learning Re-
pository page at http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.data.html .
To get started, download the data and unzip the file using the following command:
>tar xfvz 20news-bydate.tar.gz
This will create two folders: one called 20news-bydate-train and another one called
20news-bydate-test . Let's take a look at the directory structure under the training
dataset folder:
>cd 20news-bydate-train/
>ls
You will see that it contains a number of subfolders, one for each newsgroup:
alt.atheism comp.windows.x
rec.sport.hockey soc.religion.christian
comp.graphics misc.forsale
sci.crypt talk.politics.guns
comp.os.ms-windows.misc rec.autos
sci.electronics talk.politics.mideast
Search WWH ::




Custom Search