Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Algorithm: Bagging. The bagging algorithm—create an ensemble of classification models

for a learning scheme where each model gives an equally weighted prediction.

Input:

D , a set of d training tuples;

k , the number of models in the ensemble;

a classification learning scheme (decision tree algorithm, naıve Bayesian, etc.).

Output: The ensemble—a composite model, M .

Method:

(1) for i D 1 to k do // create k models:

(2) create bootstrap sample, D i , by sampling D with replacement;

(3) use D i and the learning scheme to derive a model, M i ;

(4)

endfor

To use the ensemble to classify a tuple, X :

let each of the k models classify X and return the majority vote;

Figure 8.23 Bagging.

robust to the effects of noisy data and overfitting. The increased accuracy occurs because

the composite model reduces the variance of the individual classifiers.

8.6.3 Boosting and AdaBoost

We now look at the ensemble method of boosting. As in the previous section, suppose

that as a patient, you have certain symptoms. Instead of consulting one doctor, you

choose to consult several. Suppose you assign weights to the value or worth of each doc-

tor's diagnosis, based on the accuracies of previous diagnoses they have made. The final

diagnosis is then a combination of the weighted diagnoses. This is the essence behind

boosting.

In boosting , weights are also assigned to each training tuple. A series of k classifiers is

iteratively learned. After a classifier, M i , is learned, the weights are updated to allow the

subsequent classifier, M i C1 , to “pay more attention” to the training tuples that were mis-

classified by M i . The final boosted classifier, M , combines the votes of each individual

classifier, where the weight of each classifier's vote is a function of its accuracy.

AdaBoost (short for Adaptive Boosting) is a popular boosting algorithm. Suppose

we want to boost the accuracy of a learning method. We are given D , a data set of

d class-labeled tuples,

, where y i is the class label of tuple

X i . Initially, AdaBoost assigns each training tuple an equal weight of 1

.

X 1 , y 1 /

,

.

X 2 , y 2 /

,

:::

,

.

X d , y d /

d . Generating

k classifiers for the ensemble requires k rounds through the rest of the algorithm. In

round i , the tuples from D are sampled to form a training set, D i , of size d . Sampling

=

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home