Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

val label = trimmed(r.size - 1).toInt

val categoryIdx = categories(r(3))

val categoryFeatures = Array.ofDim[Double](numCategories)

categoryFeatures(categoryIdx) = 1.0

val otherFeatures = trimmed.slice(4, r.size - 1).map(d =>

if (d == "?") 0.0 else d.toDouble)

val features = categoryFeatures ++ otherFeatures

LabeledPoint(label, Vectors.dense(features))

}

println(dataCategories.first)

You should see output similar to what is shown here. You can see that the first part of our

feature vector is now a vector of length 14 with one nonzero entry at the relevant category

index:

LabeledPoint(0.0,

[0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])

Again, since our raw features are not standardized, we should perform this transformation

using the same StandardScaler approach that we used earlier before training a new

model on this expanded dataset:

val scalerCats = new StandardScaler(withMean = true,

withStd = true).fit(dataCategories.map(lp => lp.features))

val scaledDataCats = dataCategories.map(lp =>

LabeledPoint(lp.label, scalerCats.transform(lp.features)))

We can inspect the features before and after scaling as we did earlier:

println(dataCategories.first.features)

The output is as follows:

0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556

...

The following code will print the features after scaling:

println(scaledDataCats.first.features)

Machine Learning with Spark

Search WWH ::

Custom Search

Home