Information Technology Reference
In-Depth Information
2.1.2 Algorithms for All-Relevant Feature Selection
There are two issues that are non-existent for the minimal optimal problem, but
are very important for the all relevant one. The first one is detection of weakly
relevant attributes that can be completely obscured by other attributes, the second
one is discerning between weakly but truly relevant variables from those that are
only seemingly relevant due to random fluctuations.
The concepts of strong and weak relevance, and consequently also the problem of
all relevant feature selection, are defined in a context of a perfect classifier that is able
to use all available information. Yet, in real-world applications one is restricted to
use imperfect classification algorithms, that are not capable of using all information
present in the information system, and this may influence the outcome of the feature
selection algorithm. In particular, an algorithm may not be able to find and use
some of the relevant features. In many cases this will not disturb solution of the
minimal optimal problem, provided that final predictions of a classifier are sufficiently
accurate; yet it will significantly decrease a sensitivity of an all relevant feature
selection. Hence a classification algorithm used in all relevant feature selection
should be able to detect weak and redundant attributes.
Algorithms that may be used for finding all the relevant features [ 3 , 6 , 9 , 14 ,
17 ] are designed around ensembles of decision trees, either using the random forest
algorithm [ 2 ] or an algorithm specially tailored for the task. The choice of decision
trees as base learners is due to their flexibility and relative robustness, when multiple
redundant features are present in the data set. Moreover, the estimate of the variable
importance is easily obtained for tree-based ensembles.
The second issue, namely discerning between the truly and randomly relevant
attributes arises because the analysis is performed for finite size samples. This gives
a chance for random correlations to emerge and significantly influence the results.
The probability of such an event increases with the decreasing number of objects; the
effect is also boosted by overall large number of attributes, which in addition increases
chances for random interactions between features. This issue is handled by introduc-
ing 'contrast variables' which are used as a reference. A statistical test is performed
that compares the importance of original variables with that of contrast variables.
Contrast variables have been used to find all relevant variables by four independent
groups. Tuv et al. [ 17 ] in ACE algorithm used ensembles of shallow classification
trees and iterative procedure in which the influence of the most important variables
on decision was removed in order to reveal variables of secondary importance. In
each step only these variables that were more important in the statistical test than the
75th percentile of contrast variables were deemed important.
Rudnicki et al. [ 14 ] introduced Boruta algorithm that used the importance estimate
from the random forest. The algorithm started by establishing initial ranking of
variables in random forest. Then the algorithm performed an iterative procedure in
which the least important variables were consecutively turned into contrast variables
by permuting their values between objects. Then the threshold level was increased by
a predefined step and procedure was repeated until the self-consistence was achieved.
 
Search WWH ::




Custom Search