Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Extracting features from the Kaggle/StumbleUpon

evergreen classification dataset

In this chapter, we will use a different dataset from the one we used for our recommenda-

tion model, as the MovieLens data doesn't have much for us to work with in terms of a

classification problem. We will use a dataset from a competition on Kaggle. The dataset

was provided by StumbleUpon, and the problem relates to classifying whether a given web

page is ephemeral (that is, short lived and will cease being popular soon) or evergreen (that

is, persistently popular) on their web content recommendation pages.

Note

The dataset used here can be downloaded from http://www.kaggle.com/c/stumbleupon/

data .

Download the training data ( train.tsv )—you will need to accept the terms and condi-

tions before downloading the dataset.

You can find more information about the competition at http://www.kaggle.com/c/

stumbleupon .

Before we begin, it will be easier for us to work with the data in Spark if we remove the

column name header from the first line of the file. Change to the directory in which you

downloaded the data (referred to as PATH here) and run the following command to remove

the first line and pipe the result to a new file called train_noheader.tsv :

>sed 1d train.tsv > train_noheader.tsv

Now, we are ready to start up our Spark shell (remember to run this command from your

Spark installation directory):

>./bin/spark-shell --driver-memory 4g

You can type in the code that follows for the remainder of this chapter directly into your

Spark shell.

In a manner similar to what we did in the earlier chapters, we will load the raw training

data into an RDD and inspect it:

Search WWH ::

Custom Search

Home