Information Technology Reference
In-Depth Information
Let us consider a classifier designed by empirical risk minimization using
some ( X ds ,T ds ) training set. We now denote by n d the number of ( X ds ,T ds )
instances. The classifier error rate estimated by error counting in ( X ds ,T ds )
(empirical error formula (1.4)) is then denoted P ed ( n d ). The generalization
ability of the classifier is assessed by measuring its error rate in some n t -
sized test set, ( X ts ,T ts ), assumed to be independently drawn from the same
F X,T distribution; the respective error rate is P et ( n t ). P ed ( n d ) and P et ( n t ) are
single realizations of two r.v.s; we are then interested in the main statistics of
these r.v.s, namely their expected values characterizing the average training
set and test set behaviors of the classifier. Usually, there are no closed-form
expressions of these expected values (an exception is the two-class setting
with equal-variance Gaussian inputs; see [74]) and one has to use sample
mean estimates, which we denote P ed ( n d ) and P et ( n t ), respectively.
If, for some classifier designed with n d instances, one computes P et ( n t )
with a suciently large number of ( X ts ,T ts ) sets s.t. P et ( n t )
[ P et ( n t )],
then P et ( n t ) is also a close estimation of P e ( n d ) the true error of the classifier
(see e.g., [76]). On the other hand, for a suciently large number of ( X ds ,T ds )
sets, P ed ( n d )
E
[ P ed ( n d )] will exhibit a bias monotonically decreasing with
n d relative to min P e , for consistent learning algorithms.
Unfortunately, in the usual practice one doesn't have the possibility of
drawing large numbers of independent training sets and test sets. In the usual
practice, one is given an n -sized ( X dt ,T dt ) set which will be used for both
classifier design and classifier evaluation. One is then compelled to partition
( X dt ,T dt ) into training sets and test sets. There are a number of ways of
randomly selecting n d and n t instances for the computation of P ed ( n d ) and
P et ( n t ). The basic scheme we use throughout the topic is the stratified k -
fold cross-validation scheme, where the n -sized ( X dt ,T dt ) dataset is first
randomly partitioned into k partitions of size n t = n/k , such that the classes
maintain their proportional representation (stratification). One is then able
to carry out k design experiments, using each time all the instances of ( k
E
1)
different partitions as design set and testing the classifier in the remaining
partition. Training set and test set averages ( P ed ( n d ) , P et ( n t )) and standard
deviations ( s ( P ed ( n d )) ,s ( P et ( n t ))) of the k experiments are then computed.
The standard deviations are always decreasing with the set cardinalities (as in
the binomial approximation P e (1
P e ) /n ). Note that the partition scheme
invalidates the assumption of test sets being independent of training sets.
The practical implications of this, with respect to what we said above about
P et ( n t ) and P ed ( n d ), are not important for not too small n .
Sometimes, the k -fold cross-validation scheme is repeated a certain number
of times, each time with a ( X ds ,T ds ) reshu ing, with improvement of the
quality (confidence interval) of the estimates.
The particular case of the cross-validation scheme with k =2is called
hold-out method. We use in the following examples a simpler version of the
repeated hold-out method ( simple hold-out ) where half of the data is used
 
Search WWH ::




Custom Search