Database Reference
In-Depth Information
the label and features in a LabeledPoint instance, converting the features into an
MLlib Vector .
We will also cache the data and count the number of data points:
data.cache
val numData = data.count
You will see that the value of numData is 7395.
We will explore the dataset in more detail a little later, but we will tell you now that there
are some negative feature values in the numeric data. As we saw earlier, the naïve Bayes
model requires non-negative features and will throw an error if it encounters negative val-
ues. So, for now, we will create a version of our input feature vectors for the naïve Bayes
model by setting any negative feature values to zero:
val nbData = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if
(d == "?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0
else d)
LabeledPoint(label, Vectors.dense(features))
}
Search WWH ::




Custom Search