Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Leave-one-out is a special case of k -fold cross-validation where k is set to the number

of initial tuples. That is, only one sample is “left out” at a time for the test set. In strat-

ified cross-validation , the folds are stratified so that the class distribution of the tuples

in each fold is approximately the same as that in the initial data.

In general, stratified 10-fold cross-validation is recommended for estimating accu-

racy (even if computation power allows using more folds) due to its relatively low bias

and variance.

8.5.4 Bootstrap

Unlike the accuracy estimation methods just mentioned, the bootstrap method sam-

ples the given training tuples uniformly with replacement . That is, each time a tuple is

selected, it is equally likely to be selected again and re-added to the training set. For

instance, imagine a machine that randomly selects tuples for our training set. In sam-

pling with replacement , the machine is allowed to select the same tuple more than once.

There are several bootstrap methods. A commonly used one is the .632 bootstrap ,

which works as follows. Suppose we are given a data set of d tuples. The data set is

sampled d times, with replacement, resulting in a bootstrap sample or training set of d

samples. It is very likely that some of the original data tuples will occur more than once

in this sample. The data tuples that did not make it into the training set end up forming

the test set. Suppose we were to try this out several times. As it turns out, on average,

63.2% of the original data tuples will end up in the bootstrap sample, and the remaining

36.8% will form the test set (hence, the name, .632 bootstrap).

“Where does the figure, 63.2%, come from?” Each tuple has a probability of 1

=

d of

being selected, so the probability of not being chosen is

. We have to select

d times, so the probability that a tuple will not be chosen during this whole time is

.

11

=

d

/

d . If d is large, the probability approaches e 1 D 0.368. 7 Thus, 36.8% of tuples

will not be selected for training and thereby end up in the test set, and the remaining

63.2% will form the training set.

We can repeat the sampling procedure k times, where in each iteration, we use the

current test set to obtain an accuracy estimate of the model obtained from the current

bootstrap sample. The overall accuracy of the model, M , is then estimated as

11

=

d

/

k X

i D1 .

1

k

Acc

.

M

/D

0.632 Acc

.

M i / test set C0.368 Acc

.

M i / train set /

,

(8.30)

where Acc

M i / test set is the accuracy of the model obtained with bootstrap sample i when

it is applied to test set i . Acc

.

M i / train set is the accuracy of the model obtained with boot-

strap sample i when it is applied to the original set of data tuples. Bootstrapping tends

to be overly optimistic. It works best with small data sets.

.

7 e is the base of natural logarithms, that is, e D 2.718.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home