Information Technology Reference
In-Depth Information
2.6.2.2 Cross-Validation
Cross-validation is a technique for estimating the generalization error of a
model, from data that are not used for parameter estimation (training) [Stone
1974]. First, the set of available data is split into
D
disjoint subsets. Then,
the following steps are performed, for a family of functions having the same
complexity (e.g., neural networks with a given number of hidden neurons):
•
iteration
i
, to be performed
D
times: build a training set with
D
-1 subsets
of the available data; perform several trainings, with different initial values
of the parameters; for each model, compute the mean square error (VMSE)
on the validation set made of the
N
V
remaining examples,
N
V
1
N
V
E
V
=
(
r
k
)
2
;
k
=1
store in memory the smallest VMSE thus computed
E
Vi
;
•
compute the cross-validation score from the
D
quantities
E
Vi
at each of
the
D
iterations
D
1
D
(
E
Vi
)
2
.
i
=1
That score is an estimate of the generalization error for the family of functions
thus investigated.
For instance, if
D
= 5 is chosen (that is a typical value; the process is called
5-fold cross-validation), 5 different partitions of the database are constructed;
for each partition, 80% of the data are on the training set and 20% in the
validation set. As discussed above, the cross-validation score is the square
root of the average of the VMSE's computed on each partition. That average
must be performed because 20% of the database may not be a statistically
significant sample of the distribution of all possible examples. In a heuristic
fashion, the procedure may be simplified by performing a single partition of
the database, choosing a validation set that is as close as possible to the
distribution of the available examples. To that effect, one can estimate the
Kullback-Leibler divergence ([Kullback et al. 1951; Kullback 1959] between
two probability distributions
p
1
et
p
2
,
D
(
p
1
,p
2
)=
+
∞
−∞
p
1
(
x
)ln
p
1
(
x
)
p
2
(
x
)
.
Because the expression is not symmetrical, a more satisfactory distance is
defined as
∆=
1
2
[
D
(
p
1
,p
2
)+
D
(
p
2
,p
1
)]
.
Several random partitions of the database are performed, and the partition for
which the distance between the validation set and the training set is smallest
Search WWH ::
Custom Search