All Relevant Feature Selection Methods and Applications - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

2.1.2 Algorithms for All-Relevant Feature Selection

There are two issues that are non-existent for the minimal optimal problem, but

are very important for the all relevant one. The first one is detection of weakly

relevant attributes that can be completely obscured by other attributes, the second

one is discerning between weakly but truly relevant variables from those that are

only seemingly relevant due to random fluctuations.

The concepts of strong and weak relevance, and consequently also the problem of

all relevant feature selection, are defined in a context of a perfect classifier that is able

to use all available information. Yet, in real-world applications one is restricted to

use imperfect classification algorithms, that are not capable of using all information

present in the information system, and this may influence the outcome of the feature

selection algorithm. In particular, an algorithm may not be able to find and use

some of the relevant features. In many cases this will not disturb solution of the

minimal optimal problem, provided that final predictions of a classifier are sufficiently

accurate; yet it will significantly decrease a sensitivity of an all relevant feature

selection. Hence a classification algorithm used in all relevant feature selection

should be able to detect weak and redundant attributes.

Algorithms that may be used for finding all the relevant features [ 3 , 6 , 9 , 14 ,

17 ] are designed around ensembles of decision trees, either using the random forest

algorithm [ 2 ] or an algorithm specially tailored for the task. The choice of decision

trees as base learners is due to their flexibility and relative robustness, when multiple

redundant features are present in the data set. Moreover, the estimate of the variable

importance is easily obtained for tree-based ensembles.

The second issue, namely discerning between the truly and randomly relevant

attributes arises because the analysis is performed for finite size samples. This gives

a chance for random correlations to emerge and significantly influence the results.

The probability of such an event increases with the decreasing number of objects; the

effect is also boosted by overall large number of attributes, which in addition increases

chances for random interactions between features. This issue is handled by introduc-

ing 'contrast variables' which are used as a reference. A statistical test is performed

that compares the importance of original variables with that of contrast variables.

Contrast variables have been used to find all relevant variables by four independent

groups. Tuv et al. [ 17 ] in ACE algorithm used ensembles of shallow classification

trees and iterative procedure in which the influence of the most important variables

on decision was removed in order to reveal variables of secondary importance. In

each step only these variables that were more important in the statistical test than the

75th percentile of contrast variables were deemed important.

Rudnicki et al. [ 14 ] introduced Boruta algorithm that used the importance estimate

from the random forest. The algorithm started by establishing initial ranking of

variables in random forest. Then the algorithm performed an iterative procedure in

which the least important variables were consecutively turned into contrast variables

by permuting their values between objects. Then the threshold level was increased by

a predefined step and procedure was repeated until the self-consistence was achieved.

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home