All Relevant Feature Selection Methods and Applications - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

Nevertheless, in the case of MicroMass data set all attributes deemed tentative by

Boruta using 500 trees, were later deemed confirmed by Boruta using 1,000 trees,

without any false positive hits. This suggests that when the number of variables

deemed tentative is large, it is quite likely that most of them are truly relevant and

Boruta run with larger number of trees is required.

2.4 Conclusions

As it was demonstrated in the chapter, the all-relevant feature selection algorithms

are capable of discerning between relevant and non relevant variables. The Boruta

algorithm, which was used as a representative algorithm of the class, was examined

on a wide range of synthetic problems and several recently published real-world data

sets. Algorithm works particularly well for systems for which good quality models

may be obtained by means of random forest classification algorithm. The sensitivity

of the algorithm is close to 100% for such systems. The sensitivity of Boruta can

be improved by utilising random forest with larger number of decision trees. The

level of false discoveries is very low for all data sets examined, therefore all relevant

feature selection is suitable for generation of robust knowledge.

The main factor limiting analysis with Boruta algorithm is time of computations.

The single iteration of the random forest algorithm can take several hours for larger

systems. The algorithm in the best case requires at least time equivalent to 30 random

forest iterations to complete, hence entire analysis may take more then one CPU-

week. The random forest is computationally demanding and its implementation in

R, while very useful, is not very efficient for large problems. In particular, while the

random forest is trivially parallel its implementation is strictly sequential. This limits

application of the algorithm for analysis of truly large datasets described with tens

or even hundreds thousands variables and thousands of objects.

Acknowledgments Computations were partially performed at the Interdisciplinary Centre for

Mathematical and Computational Modelling, University of Warsaw, Poland, grant G34-5. Authors

would like to thank Mr. Rafał Niemiec for technical help.

References

1. Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)

2. Breiman, L.: Random forests. Mach. Learn. 45 , 5-32 (2001)

3. Draminski, M., Kierczak, M., Koronacki, J., Komorowski, J.: Monte Carlo feature selection

and interdependency discovery in supervised classification. In: Koronacki, J. (ed.) Advances

in Machine Learning II. SCI, vol. 263, pp. 371-385. Springer (2010)

4. Draminski, M., Rada-Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.:

Monte Carlo feature selection for supervised classification. Bioinformatics 24 (1), 110-117

(2008)

Search WWH ::

Custom Search

Home