Model Data Selection and Data Pre-processing Approaches - Hydrological Data Driven Modelling

Geology Reference

In-Depth Information

3.6.3 K-Fold Cross-Validation

K-fold cross-validation is the most popular sub-sampling technique. The concept of

K-fold cross-validation is not new and it is reported by Breiman et al. [ 10 ]. Based

on their detailed simulation studies on this concept, they concluded that these

methods do not always work. Even now, detailed research is continuing to

nd the

values of K for which K-fold cross-validation works best. Some research has shown

that the success in determining the K value is highly arbitrary and depends on the

experimental settings [ 12 ]. In the case of K-fold cross-validation, all the data is split

into K equal parts and one portion is used as the test data set; the rest is used as the

training data set. Later, another portion is used as the test data in the second

experiment. This practice is iterated K times and the error is estimated in each

scenario. The true error can be estimated from the predictions of the K test data by

averaging the respective errors in each experiment. The pictorial representation of

K-fold cross-validation is given in Fig. 3.8 . If we use a large number of folds for

modeling, the bias of the true error rate estimator will be small, whereas the vari-

ance of the true error rate estimator will be large. The number of experiments, and

therefore the computation time, are less when we use a small number of folds. The

variance of the estimator is small and the bias of the estimator is larger than true

error rate. The common practice for K-fold cross validation is K = 10; which was

adopted in this thesis.

Fig. 3.8 Data splitting in K-fold cross-validation

Search WWH ::

Custom Search

Home