Information Technology Reference
In-Depth Information
Nevertheless, in the case of MicroMass data set all attributes deemed tentative by
Boruta using 500 trees, were later deemed confirmed by Boruta using 1,000 trees,
without any false positive hits. This suggests that when the number of variables
deemed tentative is large, it is quite likely that most of them are truly relevant and
Boruta run with larger number of trees is required.
2.4 Conclusions
As it was demonstrated in the chapter, the all-relevant feature selection algorithms
are capable of discerning between relevant and non relevant variables. The Boruta
algorithm, which was used as a representative algorithm of the class, was examined
on a wide range of synthetic problems and several recently published real-world data
sets. Algorithm works particularly well for systems for which good quality models
may be obtained by means of random forest classification algorithm. The sensitivity
of the algorithm is close to 100% for such systems. The sensitivity of Boruta can
be improved by utilising random forest with larger number of decision trees. The
level of false discoveries is very low for all data sets examined, therefore all relevant
feature selection is suitable for generation of robust knowledge.
The main factor limiting analysis with Boruta algorithm is time of computations.
The single iteration of the random forest algorithm can take several hours for larger
systems. The algorithm in the best case requires at least time equivalent to 30 random
forest iterations to complete, hence entire analysis may take more then one CPU-
week. The random forest is computationally demanding and its implementation in
R, while very useful, is not very efficient for large problems. In particular, while the
random forest is trivially parallel its implementation is strictly sequential. This limits
application of the algorithm for analysis of truly large datasets described with tens
or even hundreds thousands variables and thousands of objects.
Acknowledgments Computations were partially performed at the Interdisciplinary Centre for
Mathematical and Computational Modelling, University of Warsaw, Poland, grant G34-5. Authors
would like to thank Mr. RafaƂ Niemiec for technical help.
References
1. Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)
2. Breiman, L.: Random forests. Mach. Learn. 45 , 5-32 (2001)
3. Draminski, M., Kierczak, M., Koronacki, J., Komorowski, J.: Monte Carlo feature selection
and interdependency discovery in supervised classification. In: Koronacki, J. (ed.) Advances
in Machine Learning II. SCI, vol. 263, pp. 371-385. Springer (2010)
4. Draminski, M., Rada-Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.:
Monte Carlo feature selection for supervised classification. Bioinformatics 24 (1), 110-117
(2008)
 
Search WWH ::




Custom Search