All Relevant Feature Selection Methods and Applications - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

2.1.3 Random Forest

The random forest algorithm is used in the current work both as a classifier and as

an engine for the feature selection algorithm, hence we give below a short summary

of its most important qualities. It is designed as an ensemble of weak classifiers

that combine their results during the final classification of each object. Individual

classifiers are built as classification trees. Each tree is constructed using different

bootstrap sample of the training set, roughly 1

3 of objects is not used for building a

tree. At each step of the tree construction a different subset of attributes is randomly

selected and a split is performed using an attribute which leads to a best distribution

of data between nodes of the tree.

Each object has not been used by roughly 1

/

3 of trees. This object is called 'out

of bag' (OOB) for these trees, and they are the OOB trees for this object. One may

perform (OOB) error estimate by comparing the classification of the ensemble of the

OOB trees for each object with the true decision. The OOB object can be used also

for estimation of variables' importance using following procedure. For each tree all

its' OOB objects are classified and the number of votes for a correct class is recorded.

Then values of the variable under scrutiny are randomly permuted across objects, the

classification is repeated and the number of votes for a correct class is again recorded.

The importance of the variable for the single tree can be then defined as a difference

between a number of correct votes cast in original and permuted system, divided by

number of objects. The importance of the variable under scrutiny is then obtained by

averaging importance measures for individual trees. The implementation of random

forest in R library [ 11 ] is used in Boruta and also was used for classification tasks.

/

2.2 Testing Procedure

Boruta algorithm is a wrapper on the random forest, hence it is likely that quality

of feature selection depends on the quality of random forest model. Therefore in

the first step of the testing procedure we performed a series of tests of the random

forest algorithm itself on synthetic data sets. Then the performance of the all-relevant

feature selection algorithm was examined on the selected synthetic data sets as well

as on few real-world data sets.

2.2.1 Data Sets

Synthetic data sets were constructed as variants of the well known hypercube prob-

lem. In this problem a set of points are generated in corners of D -dimensional hyper-

cube, each coordinate of the corner is either

1. The corners of the hypercube

were assigned to one of two classes using twomethods. The first one relies on random

process. Corners of a hypercube are numbered 1

+

1or

−

2 D , then a random sample of

,...,

length 2 ( D − 1 ) is drawn from the range

2 D

(

,...,

)

1

and corners with these numbers are

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home