Information Technology Reference
In-Depth Information
Table 2.4 Change of sensitivity as a function of a number of trees in Boruta
Average for 4D systems Average for 5D systems
Ntree TP FN Sensitivity (%) PPV (%) TP FN Sensitivity (%) PPV (%)
100 - - - - 40.2 164.8 19.6 100
200 - - - - 70.2 134.8 34.2 100
500 34.9 21.1 57.8 99.7 123.6 81.4 60.3 100
1,000 39.9 16.0 63.5 99.5 157.6 47.4 76.9 100
2,000 43.4 12.5 68.5 98.9 180.4 24.6 88.0 99.8
5,000 - - - - 189.6 15.4 92.5 99.5
10,000 - - - - 193.2 11.8 94.2 99.4
20,000 - - - - 195.6 9.4 95.4 99.4
The average results for all 4-dimensional systems examined with Boruta using 500, 1,000, and
2,000 trees are shown in the left panel. The more detailed inspection of results for sets described
with 5 generative, 200 combination and 1,000 total variables is presented in the right panel. Average
results for five instances are presented for Boruta using 100 to 20,000 trees
Table 2.5 Results of feature selection presented for two series of 4-dimensional data sets for
varying total number of variables in the system
Ncomb Ntotal Mean TP Mean FP Mean FN Mean sensitivity (%) Mean PPV (%)
50 100 54 0.0 0.0 100.0 100.0
200 54 0.7 0.0 100.0 98.8
500 54 0.0 0.0 100.0 100.0
1,000 46.7 1.3 7.3 86.4 97.2
2,000 52.0 0.3 2.0 96.3 99.4
5,000 21.7 0.0 32.3 40.1 100.0
10,000 10.7 0.0 43.3 19.8 100.0
200 500 176.7 0.0 27.3 86.6 100.0
1,000 173.3 0.0 30.7 85.0 100.0
2,000 145.7 0.0 58.3 71.4 100.0
5,000 112.7 0.0 91.3 55.2 100.0
10,000 84.7 0.0 119.3 41.5 100.0
The average number of true and false positive, false negative, sensitivity and PPV are displayed.
The averaging was performed over Random Forest models built from 500, 1,000, and 2,000 trees
increased, the sensitivity for the system with 54 relevant variables drops faster than
for the system with 204 ones, reaching 20% when total number of variables arrives
at 10,000, whereas the sensitivity for the system with 204 relevant features is still
40% at this point.
This effect is most likely due to the method for generation of splits in random
forest algorithm. The subset of variables is randomly selected from all variables and
split is performed for the variable that produces the best split. When the number of
relevant variables is large in comparison with the sample size, the variables with low
 
Search WWH ::




Custom Search