Database Reference
In-Depth Information
resulting decision tree ignores all but one of these variables because the algorithm
cannot detect information gain by including more redundant variables. On the
other hand, if the dataset contains irrelevant variables and if these variables are
accidentally chosen as splits in the tree, the tree may grow too large and may end
up with less data at every split, where overfitting is likely to occur. To address this
problem, feature selection can be introduced in the data preprocessing phase to
eliminate the irrelevant variables.
Although decision trees are able to handle correlated variables, decision trees are
not well suited when most of the variables in the training set are correlated, since
overfitting is likely to occur. To overcome the issue of instability and potential
overfitting of deep trees, one can combine the decisions of several randomized
shallow decision trees—the basic idea of another classifier called random forest
[4]—or use ensemble methods to combine several weak learners for better
classification. These methods have been shown to improve predictive power
compared to a single decision tree.
For binary decisions, a decision tree works better if the training dataset consists
of records with an even probability of each result. In other words, the root of the
tree has a 50% chance of either classification. This occurs by randomly selecting
training records from each possible classification in equal numbers. It counteracts
the likelihood that a tree will stump out early by passing purity tests because of bias
in the training data.
When using methods such as logistic regression on a dataset with many variables,
decision trees can help determine which variables are the most useful to select
based on information gain. Then these variables can be selected for the logistic
regression. Decision trees can also be used to prune redundant variables.
7.1.5 Decision Trees in R
In R, rpart is for modeling decision trees, and an optional package rpart.plot
enables the plotting of a tree. The rest of this section shows an example of how
to use decision trees in R with rpart.plot to predict whether to play golf given
factors such as weather outlook, temperature, humidity, and wind.
In R, first set the working directory and initialize the packages.
setwd("c:/")
install.packages("rpart.plot") # install package rpart.plot
Search WWH ::




Custom Search