Database Reference
In-Depth Information
Extracting features from the Kaggle/StumbleUpon
evergreen classification dataset
In this chapter, we will use a different dataset from the one we used for our recommenda-
tion model, as the MovieLens data doesn't have much for us to work with in terms of a
classification problem. We will use a dataset from a competition on Kaggle. The dataset
was provided by StumbleUpon, and the problem relates to classifying whether a given web
page is ephemeral (that is, short lived and will cease being popular soon) or evergreen (that
is, persistently popular) on their web content recommendation pages.
Note
The dataset used here can be downloaded from http://www.kaggle.com/c/stumbleupon/
data .
Download the training data ( train.tsv )—you will need to accept the terms and condi-
tions before downloading the dataset.
You can find more information about the competition at http://www.kaggle.com/c/
stumbleupon .
Before we begin, it will be easier for us to work with the data in Spark if we remove the
column name header from the first line of the file. Change to the directory in which you
downloaded the data (referred to as PATH here) and run the following command to remove
the first line and pipe the result to a new file called train_noheader.tsv :
>sed 1d train.tsv > train_noheader.tsv
Now, we are ready to start up our Spark shell (remember to run this command from your
Spark installation directory):
>./bin/spark-shell --driver-memory 4g
You can type in the code that follows for the remainder of this chapter directly into your
Spark shell.
In a manner similar to what we did in the earlier chapters, we will load the raw training
data into an RDD and inspect it:
Search WWH ::




Custom Search