Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Extracting the TF-IDF features from the 20

Newsgroups dataset

To illustrate the concepts in this chapter, we will use a well-known text dataset called 20

Newsgroups ; this dataset is commonly used for text-classification tasks. This is a collec-

tion of newsgroup messages posted across 20 different topics. There are various forms of

data available. For our purposes, we will use the bydate version of the dataset, which is

available at http://qwone.com/~jason/20Newsgroups .

This dataset splits up the available data into training and test sets that comprise 60 percent

and 40 percent of the original data, respectively. Here, the messages in the test set occur

after those in the training set. This dataset also excludes some of the message headers that

identify the actual newsgroup; hence, it is an appropriate dataset to test the real-world per-

formance of classification models.

Note

Further information on the original dataset can be found in the UCI Machine Learning Re-

To get started, download the data and unzip the file using the following command:

>tar xfvz 20news-bydate.tar.gz

This will create two folders: one called 20news-bydate-train and another one called

20news-bydate-test . Let's take a look at the directory structure under the training

dataset folder:

>cd 20news-bydate-train/

>ls

You will see that it contains a number of subfolders, one for each newsgroup:

alt.atheism comp.windows.x

rec.sport.hockey soc.religion.christian

comp.graphics misc.forsale

sci.crypt talk.politics.guns

comp.os.ms-windows.misc rec.autos

sci.electronics talk.politics.mideast

Search WWH ::

Custom Search

Home