All Relevant Feature Selection Methods and Applications - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

The procedure was carried out until the importance of all contrast variables was lower

than that of the unperturbed variables.

Huynh-Thu et al. [ 6 ] independently proposed a procedure that aimed at the same

goal from another end. In this approach the procedure starts similarly from establish-

ing the ranking of importance from RF. Then the algorithm estimates the importance

of noninformative variables by turning all variables into contrast variables. In the fol-

lowing steps the algorithm iteratively introduces back the informative variables into

the information system and computes the importance of both i informative variables

(the original most important variables) and N

i noninformative variables.

Drami nski et al. [ 4 ] introduced the MCFS algorithm to improve a feature ranking

obtained from an ensemble of decision trees. It was constructed in such a way that

eliminated known bias of random forest towards variables with fewer number of

values. The algorithm was later extended for use as an all-relevant feature selection

algorithm [ 3 ] by introducing a comparison of the importance of variables with the

maximal importance obtained from a set where all variables were uninformative.

The second version of Boruta [ 9 ] was introduced to improve computational effi-

ciency and used a different heuristic procedure. In this version the original dataset

is extended with random contrast variables. For each original attribute a 'shadow'

attribute is generated by randomly permuting its values. Then, for each attribute,

it is tested whether its importance is higher than the maximal importance achieved

by a contrast attribute. In order to obtain statistically significant results this proce-

dure is repeated several times, with contrast variables generated independently for

each iteration. After each iteration, the algorithm checks how many times the impor-

tance of tested attributes is higher (or lower) than that of the highest ranked contrast

variable. Once this number is significantly higher than allowed under hypothesis of

equality with importance of highest random contrast, the attribute is deemed relevant

and not tested further. On the other hand, if this number is significantly lower than

allowed under the same hypothesis, then the attribute is deemed irrelevant and perma-

nently removed from the data set. The corresponding contrast variables can be either

retained or removed from the dataset; the former choice increases precision of the

result, whereas the other greatly improves computational efficiency. The algorithm

is terminated when either the relevance of all attributes is established or until prede-

fined number of steps is executed. The result of the algorithm is the assignment of

each variable to one of three classes—relevant, irrelevant, unresolved (or tentative).

The final decision about the unresolved (tentative) attributes is left to the user.

All these algorithms are quite similar to each other: they are based on the ensemble

of trees, they use similar measures of importance and use contrast variables to discern

relevant and non relevant attributes. They differ mostly in implementation of the

statistical test as well as in performance. The current study is devoted to detailed

analysis of performance of Boruta algorithm for a family of synthetic data sets with

varying number of truly relevant variables and total number of variables, and hence

varying difficulty. The difficulty is measured as the error level of the random forest

classifier built on the truly relevant variables. The small scale tests performed by us

have shown that results of these algorithms are also similar, hence we believe that the

analysis performed for single algorithm will be relevant also for other algorithms.

−

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home