Databases Reference
In-Depth Information
of being drawn in each trial. The Bagging algorithm shown in Fig. 6.7
does exactly this- N times; the algorithm chooses a number r at random
from 1 to N and adds the r th training example to the bootstrap training
set S . Clearly, some of the original training examples will not be selected
for inclusion in the bootstrapped training set and others will be chosen
one time or more. On average, each generated bootstrapped training set
will contain 0.63 N unique training examples even though it will contain N
actual training examples. In bagging, we create M such bootstrap training
sets and then generate classifiers using each of them. Bagging returns a
function h ( x ) that classifies new examples by returning the class y that
gets the maximum number of votes from the base models
.
In bagging, the M bootstrap training sets that are created are likely to
have some differences. If these differences are enough to induce noticeable
differences among the M base models while leaving their performances
reasonably good, then the ensemble will probably perform better than the
base models individually.
{
h 1 ,h 2 ,...,h M }
Random Forest:
Random forest is an ensemble of unpruned classifi-
cation or regression trees, induced from bootstrap samples of the training
data, using random feature selection in the tree induction process.
Prediction is made by aggregating (majority vote for classification or
averaging for regression) the predictions of the ensemble. Random forest
generally exhibits a substantial performance improvement over the single
tree classifier such as CART and C4.5. It yields generalization error rate
that compares favorably to AdaBoost, yet is more robust to noise. However,
similar to most classifiers, random forest can also suffer from the curse of
learning from extremely imbalanced training data set. As it is constructed to
minimize the overall error rate, it will tend to focus more on the prediction
accuracy of the majority class, which often results in poor accuracy, for the
minority class.
In random forests, there is no need for cross validation or a test set to
get an unbiased estimate of the test error. Since each tree is constructed
using the bootstrap sample, approximately 3
rd of the cases are left out of
the bootstrap samples and not used in training. These cases are called out
of bag (oob) cases. These oob cases are used to get a run-time unbiased
estimate of the classification error as trees are added to the forest.
The error rate of a forest depends on the correlation between any two
trees and the strength of each tree in the forest. Increasing the correlation
Search WWH ::




Custom Search