Information Technology Reference
In-Depth Information
The procedure was carried out until the importance of all contrast variables was lower
than that of the unperturbed variables.
Huynh-Thu et al. [ 6 ] independently proposed a procedure that aimed at the same
goal from another end. In this approach the procedure starts similarly from establish-
ing the ranking of importance from RF. Then the algorithm estimates the importance
of noninformative variables by turning all variables into contrast variables. In the fol-
lowing steps the algorithm iteratively introduces back the informative variables into
the information system and computes the importance of both i informative variables
(the original most important variables) and N
i noninformative variables.
Drami nski et al. [ 4 ] introduced the MCFS algorithm to improve a feature ranking
obtained from an ensemble of decision trees. It was constructed in such a way that
eliminated known bias of random forest towards variables with fewer number of
values. The algorithm was later extended for use as an all-relevant feature selection
algorithm [ 3 ] by introducing a comparison of the importance of variables with the
maximal importance obtained from a set where all variables were uninformative.
The second version of Boruta [ 9 ] was introduced to improve computational effi-
ciency and used a different heuristic procedure. In this version the original dataset
is extended with random contrast variables. For each original attribute a 'shadow'
attribute is generated by randomly permuting its values. Then, for each attribute,
it is tested whether its importance is higher than the maximal importance achieved
by a contrast attribute. In order to obtain statistically significant results this proce-
dure is repeated several times, with contrast variables generated independently for
each iteration. After each iteration, the algorithm checks how many times the impor-
tance of tested attributes is higher (or lower) than that of the highest ranked contrast
variable. Once this number is significantly higher than allowed under hypothesis of
equality with importance of highest random contrast, the attribute is deemed relevant
and not tested further. On the other hand, if this number is significantly lower than
allowed under the same hypothesis, then the attribute is deemed irrelevant and perma-
nently removed from the data set. The corresponding contrast variables can be either
retained or removed from the dataset; the former choice increases precision of the
result, whereas the other greatly improves computational efficiency. The algorithm
is terminated when either the relevance of all attributes is established or until prede-
fined number of steps is executed. The result of the algorithm is the assignment of
each variable to one of three classes—relevant, irrelevant, unresolved (or tentative).
The final decision about the unresolved (tentative) attributes is left to the user.
All these algorithms are quite similar to each other: they are based on the ensemble
of trees, they use similar measures of importance and use contrast variables to discern
relevant and non relevant attributes. They differ mostly in implementation of the
statistical test as well as in performance. The current study is devoted to detailed
analysis of performance of Boruta algorithm for a family of synthetic data sets with
varying number of truly relevant variables and total number of variables, and hence
varying difficulty. The difficulty is measured as the error level of the random forest
classifier built on the truly relevant variables. The small scale tests performed by us
have shown that results of these algorithms are also similar, hence we believe that the
analysis performed for single algorithm will be relevant also for other algorithms.
Search WWH ::




Custom Search