Information Technology Reference
In-Depth Information
classification in the best settings, without additional noise from random features.
The random forest implementation in R [ 11 ] was used to perform classification,
using the default parameters: 500 trees in ensemble and the number of variables used
for split generation set at the square root of the total.
2.2.3 Feature Selection
Two series of data sets were selected for further analysis with Boruta feature selection
algorithm implemented in a package in R [ 8 ]. In higher dimensions both functions
using deterministic class assignment generated data sets that were too difficult for the
random forest algorithm, hence only sets generated with the help of two functions
from mlbench package that were based on randomwere used in feature selection test.
The result of the classification testing has shown that the quality of models depends
monotonically on the number of objects—theOOBclassification error decreaseswith
increasing number of objects. This relationship was universal, but in some cases the
number of objects required to obtain model of good quality was very high. This is
especially true for the high dimensional problems. Therefore to reduce the number of
variable parameters we fixed the number of objects at single value 500, that allowed
us to scan a wide range of difficulties for numbers of variables varying between 50
and 10,000.
The tests were performed for the following grid of parameters describing data sets:
N gen
= (
,
,
,
)
×
= (
,
,
,
,
,
,
)
2
3
4
5
generative variables
N comb
5
10
20
50
100
200
500
×
= (
,
,
,
,
,
,
,
,
,
,
,
)
combination variables
N all
50
100
200
500
1
000
2
000
5
000
10
000
all variables, where N all =
N rand (and N rand is a number of random
variables). Obviously, the grid points corresponding to negative number of random
variables were not explored. The number of variable parameters in the test is four,
therefore generation of every possible combination is neither feasible nor interesting.
Data sets that are either very easy or very difficult are not interesting for further
analysis. For the purpose of this work the data set was considered easy when the
OOB estimate of the classification error of random forest model is below 2% and
it is considered hard when the OOB error is above 30%. Therefore only a subset of
possible datasets within the range of parameters was generated and tested. For each
number of generative variables the initial test system was generated that comprised
of 500 objects with 5 combination features and 50 random features. Then the number
of objects, combination features and random features was varied until either easy or
hard region of the parameter space was found.
Additionally, the influence of number of trees in the forest on the feature selection
procedurewas examined. To this end, the entire procedurewas repeated using random
forest classifiers obtained with three different numbers of trees, namely 500, 1,000
and 2,000.
Despite fixing the number of objects and examining only two set series of data
sets, the number of possible combinations was still too high to be practical, hence
not all of the possible grid points were examined. The set of combinations examined
is given in Table 2.1 .
N gen +
N comb +
 
Search WWH ::




Custom Search