Information Technology Reference
In-Depth Information
2.1.3 Random Forest
The random forest algorithm is used in the current work both as a classifier and as
an engine for the feature selection algorithm, hence we give below a short summary
of its most important qualities. It is designed as an ensemble of weak classifiers
that combine their results during the final classification of each object. Individual
classifiers are built as classification trees. Each tree is constructed using different
bootstrap sample of the training set, roughly 1
3 of objects is not used for building a
tree. At each step of the tree construction a different subset of attributes is randomly
selected and a split is performed using an attribute which leads to a best distribution
of data between nodes of the tree.
Each object has not been used by roughly 1
/
3 of trees. This object is called 'out
of bag' (OOB) for these trees, and they are the OOB trees for this object. One may
perform (OOB) error estimate by comparing the classification of the ensemble of the
OOB trees for each object with the true decision. The OOB object can be used also
for estimation of variables' importance using following procedure. For each tree all
its' OOB objects are classified and the number of votes for a correct class is recorded.
Then values of the variable under scrutiny are randomly permuted across objects, the
classification is repeated and the number of votes for a correct class is again recorded.
The importance of the variable for the single tree can be then defined as a difference
between a number of correct votes cast in original and permuted system, divided by
number of objects. The importance of the variable under scrutiny is then obtained by
averaging importance measures for individual trees. The implementation of random
forest in R library [ 11 ] is used in Boruta and also was used for classification tasks.
/
2.2 Testing Procedure
Boruta algorithm is a wrapper on the random forest, hence it is likely that quality
of feature selection depends on the quality of random forest model. Therefore in
the first step of the testing procedure we performed a series of tests of the random
forest algorithm itself on synthetic data sets. Then the performance of the all-relevant
feature selection algorithm was examined on the selected synthetic data sets as well
as on few real-world data sets.
2.2.1 Data Sets
Synthetic data sets were constructed as variants of the well known hypercube prob-
lem. In this problem a set of points are generated in corners of D -dimensional hyper-
cube, each coordinate of the corner is either
1. The corners of the hypercube
were assigned to one of two classes using twomethods. The first one relies on random
process. Corners of a hypercube are numbered 1
+
1or
2 D , then a random sample of
,...,
length 2 ( D 1 ) is drawn from the range
2 D
(
,...,
)
1
and corners with these numbers are
 
Search WWH ::




Custom Search