Database Reference
In-Depth Information
val rawData = sc.textFile("/ PATH /train_noheader.tsv")
val records = rawData.map(line => line.split("\t"))
records.first()
You will the following on the screen:
Array[String] = Array("http://www.bloomberg.com/news/
2010-12-23/
ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html",
"4042", ...
You can check the fields that are available by reading through the overview on the dataset
page above. The first two columns contain the URL and ID of the page. The next column
contains some raw textual content. The next column contains the category assigned to the
page. The next 22 columns contain numeric or categorical features of various kinds. The
final column contains the target—1 is evergreen, while 0 is non-evergreen.
We'll start off with a simple approach of using only the available numeric features direc-
tly. As each categorical variable is binary, we already have a 1-of-k encoding for these
variables, so we don't need to do any further feature extraction.
Due to the way the data is formatted, we will have to do a bit of data cleaning during our
initial processing by trimming out the extra quotation characters ( " ). There are also miss-
ing values in the dataset; they are denoted by the "?" character. In this case, we will
simply assign a zero value to these missing values:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if
(d == "?") 0.0 else d.toDouble)
LabeledPoint(label, Vectors.dense(features))
}
In the preceding code, we extracted the label variable from the last column and an array of
features for columns 5 to 25 after cleaning and dealing with missing values. We converted
the label to an Int value and the features to an Array[Double] . Finally, we wrapped
Search WWH ::




Custom Search