Database Reference
In-Depth Information
Additional features
We have seen that we need to be careful about standardizing and potentially normalizing
our features, and the impact on model performance can be serious. In this case, we used
only a portion of the features available. For example, we completely ignored the category
variable and the textual content in the boilerplate variable column.
This was done for ease of illustration, but let's assess the impact of adding an additional
feature such as the category feature.
First, we will inspect the categories and form a mapping of index to category, which you
might recognize as the basis for a 1-of-k encoding of this categorical feature:
val categories = records.map(r =>
r(3)).distinct.collect.zipWithIndex.toMap
val numCategories = categories.size
println(categories)
The output of the different categories is as follows:
Map("weather" -> 0, "sports" -> 6, "unknown" -> 4,
"computer_internet" -> 12, "?" -> 11, "culture_politics" ->
3, "religion" -> 8, "recreation" -> 2, "arts_entertainment"
-> 9, "health" -> 5, "law_crime" -> 10, "gaming" -> 13,
"business" -> 1, "science_technology" -> 7)
The following code will print the number of categories:
println(numCategories)
Here is the output:
14
So, we will need to create a vector of length 14 to represent this feature and assign a value
of 1 for the index of the relevant category for each data point. We can then prepend this
new feature vector to the vector of other numerical features:
val dataCategories = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
Search WWH ::




Custom Search