All Relevant Feature Selection Methods and Applications - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

classification in the best settings, without additional noise from random features.

The random forest implementation in R [ 11 ] was used to perform classification,

using the default parameters: 500 trees in ensemble and the number of variables used

for split generation set at the square root of the total.

2.2.3 Feature Selection

Two series of data sets were selected for further analysis with Boruta feature selection

algorithm implemented in a package in R [ 8 ]. In higher dimensions both functions

using deterministic class assignment generated data sets that were too difficult for the

random forest algorithm, hence only sets generated with the help of two functions

from mlbench package that were based on randomwere used in feature selection test.

The result of the classification testing has shown that the quality of models depends

monotonically on the number of objects—theOOBclassification error decreaseswith

increasing number of objects. This relationship was universal, but in some cases the

number of objects required to obtain model of good quality was very high. This is

especially true for the high dimensional problems. Therefore to reduce the number of

variable parameters we fixed the number of objects at single value 500, that allowed

us to scan a wide range of difficulties for numbers of variables varying between 50

and 10,000.

The tests were performed for the following grid of parameters describing data sets:

N gen

= (

)

= (

)

generative variables

N comb

100

200

500

= (

)

combination variables

N all

100

200

500

000

all variables, where N all =

N rand (and N rand is a number of random

variables). Obviously, the grid points corresponding to negative number of random

variables were not explored. The number of variable parameters in the test is four,

therefore generation of every possible combination is neither feasible nor interesting.

Data sets that are either very easy or very difficult are not interesting for further

analysis. For the purpose of this work the data set was considered easy when the

OOB estimate of the classification error of random forest model is below 2% and

it is considered hard when the OOB error is above 30%. Therefore only a subset of

possible datasets within the range of parameters was generated and tested. For each

number of generative variables the initial test system was generated that comprised

of 500 objects with 5 combination features and 50 random features. Then the number

of objects, combination features and random features was varied until either easy or

hard region of the parameter space was found.

Additionally, the influence of number of trees in the forest on the feature selection

procedurewas examined. To this end, the entire procedurewas repeated using random

forest classifiers obtained with three different numbers of trees, namely 500, 1,000

and 2,000.

Despite fixing the number of objects and examining only two set series of data

sets, the number of possible combinations was still too high to be practical, hence

not all of the possible grid points were examined. The set of combinations examined

is given in Table 2.1 .

N gen +

N comb +

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home