Model Data Selection and Data Pre-processing Approaches - Hydrological Data Driven Modelling

Geology Reference

In-Depth Information

However, its evaluation can have high variance because the samples are not repre-

sentative. The evaluation largely depends on which data points end up in the training

and test sets.

The Repeated Holdout Method is another modi

ed approach of the above-

mentioned basic concept. In this, an attempt is made to have more reliability in

holdout estimations by repeating the process with different resampling approaches.

This advanced version of approach commonly uses strati

ed sampling to ensure

that each class is represented with approximately equal proportions in both subsets.

The errors on the different iterations of subsets are averaged to yield an overall error

rate. However, this advanced version is not completely free from bias in training

and testing data sets. Another disadvantage is overlapping of different test sets.

3.6.2 Random Sub-sampling

This is another famous CVA. Random sub-sampling is also known as Monte Carlo

cross-validation or repeated evaluation set in literature [ 61 ]. In this approach, the

whole data is randomly split into subsets (as shown in Fig. 3.7 ) in which the size of

the subsets is arbitrarily decided by the user. Some research suggests that random

sub-sampling is asymptotically consistent, resulting in more pessimistic predictions

of the test data compared with conventional full cross-validation and making more

realistic estimations of the predictions of external validation data [ 66 , 82 ].

Fig. 3.7 Data splitting in the random sub-sampling approach

Search WWH ::

Custom Search

Home