All Relevant Feature Selection Methods and Applications - Feature Selection for Data and Pattern Recognition - page 23

Information Technology Reference

In-Depth Information

Table 2.4 Change of sensitivity as a function of a number of trees in Boruta

Average for 4D systems Average for 5D systems

Ntree TP FN Sensitivity (%) PPV (%) TP FN Sensitivity (%) PPV (%)

100 - - - - 40.2 164.8 19.6 100

200 - - - - 70.2 134.8 34.2 100

500 34.9 21.1 57.8 99.7 123.6 81.4 60.3 100

1,000 39.9 16.0 63.5 99.5 157.6 47.4 76.9 100

2,000 43.4 12.5 68.5 98.9 180.4 24.6 88.0 99.8

5,000 - - - - 189.6 15.4 92.5 99.5

10,000 - - - - 193.2 11.8 94.2 99.4

20,000 - - - - 195.6 9.4 95.4 99.4

The average results for all 4-dimensional systems examined with Boruta using 500, 1,000, and

2,000 trees are shown in the left panel. The more detailed inspection of results for sets described

with 5 generative, 200 combination and 1,000 total variables is presented in the right panel. Average

results for five instances are presented for Boruta using 100 to 20,000 trees

Table 2.5 Results of feature selection presented for two series of 4-dimensional data sets for

varying total number of variables in the system

Ncomb Ntotal Mean TP Mean FP Mean FN Mean sensitivity (%) Mean PPV (%)

50 100 54 0.0 0.0 100.0 100.0

200 54 0.7 0.0 100.0 98.8

500 54 0.0 0.0 100.0 100.0

1,000 46.7 1.3 7.3 86.4 97.2

2,000 52.0 0.3 2.0 96.3 99.4

5,000 21.7 0.0 32.3 40.1 100.0

10,000 10.7 0.0 43.3 19.8 100.0

200 500 176.7 0.0 27.3 86.6 100.0

1,000 173.3 0.0 30.7 85.0 100.0

2,000 145.7 0.0 58.3 71.4 100.0

5,000 112.7 0.0 91.3 55.2 100.0

10,000 84.7 0.0 119.3 41.5 100.0

The average number of true and false positive, false negative, sensitivity and PPV are displayed.

The averaging was performed over Random Forest models built from 500, 1,000, and 2,000 trees

increased, the sensitivity for the system with 54 relevant variables drops faster than

for the system with 204 ones, reaching 20% when total number of variables arrives

at 10,000, whereas the sensitivity for the system with 204 relevant features is still

40% at this point.

This effect is most likely due to the method for generation of splits in random

forest algorithm. The subset of variables is randomly selected from all variables and

split is performed for the variable that produces the best split. When the number of

relevant variables is large in comparison with the sample size, the variables with low

Next Page

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home