Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Additional features

We have seen that we need to be careful about standardizing and potentially normalizing

our features, and the impact on model performance can be serious. In this case, we used

only a portion of the features available. For example, we completely ignored the category

variable and the textual content in the boilerplate variable column.

This was done for ease of illustration, but let's assess the impact of adding an additional

feature such as the category feature.

First, we will inspect the categories and form a mapping of index to category, which you

might recognize as the basis for a 1-of-k encoding of this categorical feature:

val categories = records.map(r =>

r(3)).distinct.collect.zipWithIndex.toMap

val numCategories = categories.size

println(categories)

The output of the different categories is as follows:

Map("weather" -> 0, "sports" -> 6, "unknown" -> 4,

"computer_internet" -> 12, "?" -> 11, "culture_politics" ->

3, "religion" -> 8, "recreation" -> 2, "arts_entertainment"

-> 9, "health" -> 5, "law_crime" -> 10, "gaming" -> 13,

"business" -> 1, "science_technology" -> 7)

The following code will print the number of categories:

println(numCategories)

Here is the output:

14

So, we will need to create a vector of length 14 to represent this feature and assign a value

of 1 for the index of the relevant category for each data point. We can then prepend this

new feature vector to the vector of other numerical features:

val dataCategories = records.map { r =>

val trimmed = r.map(_.replaceAll("\"", ""))

Search WWH ::

Custom Search

Home