Classification: Basic Concepts - Data Mining: Concepts and Techniques - page 378

Databases Reference

In-Depth Information

imbalance problem . We also study techniques for improving the classification accuracy

of class-imbalanced data. These are presented in Section 8.6.5.

8.6.1 Introducing Ensemble Methods

Bagging , boosting , and random forests are examples of ensemble methods (Figure 8.21).

An ensemble combines a series of k learned models (or base classifiers ), M 1 , M 2 ,

, M k ,

with the aim of creating an improved composite classification model, M . A given data

set, D , is used to create k training sets, D 1 , D 2 ,

:::

is used

to generate classifier M i . Given a new data tuple to classify, the base classifiers each vote

by returning a class prediction. The ensemble returns a class prediction based on the

votes of the base classifiers.

An ensemble tends to be more accurate than its base classifiers. For example, con-

sider an ensemble that performs majority voting. That is, given a tuple X to classify, it

collects the class label predictions returned from the base classifiers and outputs the class

in majority. The base classifiers may make mistakes, but the ensemble will misclassify X

only if over half of the base classifiers are in error. Ensembles yield better results when

there is significant diversity among the models. That is, ideally, there is little correla-

tion among classifiers. The classifiers should also perform better than random guessing.

Each base classifier can be allocated to a different CPU and so ensemble methods are

parallelizable.

To help illustrate the power of an ensemble, consider a simple two-class problem

described by two attributes, x 1 and x 2 . The problem has a linear decision boundary.

Figure 8.22(a) shows the decision boundary of a decision tree classifier on the problem.

Figure 8.22(b) shows the decision boundary of an ensemble of decision tree classifiers

on the same problem. Although the ensemble's decision boundary is still piecewise

constant, it has a finer resolution and is better than that of a single tree.

:::

, D k , where D i .

1 i k 1

/

M 1

New data

tuple

D 1

M 2

D 2

•

Combine

votes

Prediction

Data, D

D k

M k

Figure 8.21 Increasing classifier accuracy: Ensemble methods generate a set of classification models,

M 1 , M 2 ,

, M k . Given a new data tuple to classify, each classifier “votes” for the class label

of that tuple. The ensemble combines the votes to return a class prediction.

:::

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home