Databases Reference
In-Depth Information
are in the same class or you end up with no features left. In this case,
you take the majority vote.
Often people “prune the tree” afterwards to avoid overfitting. This just
means cutting it off below a certain depth. After all, by design, the
algorithm gets weaker and weaker as you build the tree, and it's well
known that if you build the entire tree, it's often less accurate (with
new data) than if you prune it.
This is an example of an embedded feature selection algorithm. (Why
embedded?) You don't need to use a filter here because the information
gain method is doing your feature selection for you.
Suppose you have your Chasing Dragons dataset. Your outcome vari‐
able is Return : a binary variable that captures whether or not the user
returns next month, and you have tons of predictors. You can use the
R library rpart and the function rpart , and the code would look like
this:
# Classification Tree with rpart
library ( rpart )
# grow tree
model1 <- rpart ( Return ~ profile + num_dragons +
num_friends_invited + gender + age +
num_days , method = "class" , data = chasingdragons )
printcp ( model1 ) # display the results
plotcp ( model1 ) # visualize cross-validation results
summary ( model1 ) # detailed summary of thresholds picked to
transform to binary
# plot tree
plot ( model1 , uniform = TRUE ,
main = "Classification Tree for Chasing Dragons" )
text ( model1 , use.n = TRUE , all = TRUE , cex = .8 )
Handling Continuous Variables in Decision Trees
Packages that already implement decision trees can handle continuous
variables for you. So you can provide continuous features, and it will
determine an optimal threshold for turning the continuous variable
into a binary predictor. But if you are building a decision tree algorithm
yourself, then in the case of continuous variables, you need to deter‐
mine the correct threshold of a value so that it can be thought of as a
binary variable. So you could partition a user's number of dragon slays
into “less than 10” and “at least 10,” and you'd be getting back to the
Search WWH ::




Custom Search