Databases Reference
In-Depth Information
imbalance problem . We also study techniques for improving the classification accuracy
of class-imbalanced data. These are presented in Section 8.6.5.
8.6.1 Introducing Ensemble Methods
Bagging , boosting , and random forests are examples of ensemble methods (Figure 8.21).
An ensemble combines a series of k learned models (or base classifiers ), M 1 , M 2 ,
, M k ,
with the aim of creating an improved composite classification model, M . A given data
set, D , is used to create k training sets, D 1 , D 2 ,
:::
is used
to generate classifier M i . Given a new data tuple to classify, the base classifiers each vote
by returning a class prediction. The ensemble returns a class prediction based on the
votes of the base classifiers.
An ensemble tends to be more accurate than its base classifiers. For example, con-
sider an ensemble that performs majority voting. That is, given a tuple X to classify, it
collects the class label predictions returned from the base classifiers and outputs the class
in majority. The base classifiers may make mistakes, but the ensemble will misclassify X
only if over half of the base classifiers are in error. Ensembles yield better results when
there is significant diversity among the models. That is, ideally, there is little correla-
tion among classifiers. The classifiers should also perform better than random guessing.
Each base classifier can be allocated to a different CPU and so ensemble methods are
parallelizable.
To help illustrate the power of an ensemble, consider a simple two-class problem
described by two attributes, x 1 and x 2 . The problem has a linear decision boundary.
Figure 8.22(a) shows the decision boundary of a decision tree classifier on the problem.
Figure 8.22(b) shows the decision boundary of an ensemble of decision tree classifiers
on the same problem. Although the ensemble's decision boundary is still piecewise
constant, it has a finer resolution and is better than that of a single tree.
:::
, D k , where D i .
1 i k 1
/
M 1
New data
tuple
D 1
M 2
D 2
Combine
votes
Prediction
Data, D
￿
D k
M k
Figure 8.21 Increasing classifier accuracy: Ensemble methods generate a set of classification models,
M 1 , M 2 ,
, M k . Given a new data tuple to classify, each classifier “votes” for the class label
of that tuple. The ensemble combines the votes to return a class prediction.
:::
 
Search WWH ::




Custom Search