Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

val rawData = sc.textFile("/ PATH /train_noheader.tsv")

val records = rawData.map(line => line.split("\t"))

records.first()

You will the following on the screen:

Array[String] = Array("http://www.bloomberg.com/news/

2010-12-23/

ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html",

"4042", ...

You can check the fields that are available by reading through the overview on the dataset

page above. The first two columns contain the URL and ID of the page. The next column

contains some raw textual content. The next column contains the category assigned to the

page. The next 22 columns contain numeric or categorical features of various kinds. The

final column contains the target—1 is evergreen, while 0 is non-evergreen.

We'll start off with a simple approach of using only the available numeric features direc-

tly. As each categorical variable is binary, we already have a 1-of-k encoding for these

variables, so we don't need to do any further feature extraction.

Due to the way the data is formatted, we will have to do a bit of data cleaning during our

initial processing by trimming out the extra quotation characters ( " ). There are also miss-

ing values in the dataset; they are denoted by the "?" character. In this case, we will

simply assign a zero value to these missing values:

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

val data = records.map { r =>

val trimmed = r.map(_.replaceAll("\"", ""))

val label = trimmed(r.size - 1).toInt

val features = trimmed.slice(4, r.size - 1).map(d => if

(d == "?") 0.0 else d.toDouble)

LabeledPoint(label, Vectors.dense(features))

}

In the preceding code, we extracted the label variable from the last column and an array of

features for columns 5 to 25 after cleaning and dealing with missing values. We converted

the label to an Int value and the features to an Array[Double] . Finally, we wrapped

Search WWH ::

Custom Search

Home