Data Sets and Proper Statistical Analysis of Data Mining Techniques - Data Preprocessing in Data Mining - page 22

Graphics Reference

In-Depth Information

Fig. 2.2 K-fold process

[ 6 , 26 ]. The value of k may vary, 5 and 10 being the most common ones. Such a value

needs to be adjusted to avoid to generate a small test partition poorly populated with

examples that may bias the performance measures used. If big data sets are being

used, 10-FCV is usually utilized while for smaller data sets 5-FCV is more frequent.

Simple k -FCVmay also lead to disarranging the proportion of examples fromeach

class in the test partition. The most commonly employed method in the literature to

avoid this problem is stratified k -FCV . It places an equal number of samples of each

class on each partition to maintain class distributions equal in all partitions

Other popular validation schemes are:

•

×

In 5

2 CV [ 22 ] the whole data set is randomly partitioned in two subsets A

and B . Then the model is first built using A and validated with B and then the

process is reversed with the model built with B and tested with A . This partitioning

process is repeated as desired aggregating the performance measure in each step.

Figure 2.3 illustrates the process. Stratified 5

×

2 cross-validation is the variation

most commonly used in this scheme.

•

Leave one out is an extreme case of k -FCV, where k equals the number of examples

in the data set. In each step only one instance is used to test the model whereas the

rest of instances are used to learn it.

How to partition the data is a key issue as it will largely influence in the perfor-

mance of themethods and in the conclusions extracted from that point on. Performing

a bad partitioning will surely lead to incomplete and/or biased behavior data about

the model being evaluated. This issue is being actively investigated nowadays, with

special attention being paid to data set shift [ 21 ] as a decisive factor that impose large

k values in k -FCV to reach performance stability in the model being evaluated.

Next Page

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home