Databases Reference
In-Depth Information
Tabl e 1 . NIPS2003 feature selection challenge data
Dataset
Size (MB)
Type
Number of
Training
Validation
variables
examples
examples
Arcene
8 . 7
Dense
10 , 000
100
100
Gisette
22 . 5
ense
5 , 000
6 , 000
1 , 000
Dexter
0 . 9
Sparse integer
20 , 000
300
300
Dorothea
4 . 7
Sparse binary
100 , 000
800
350
Madelon
2 . 9
Dense
500
2 , 000
600
Tabl e 2 . Comparison of no variable selection to variable selection
Data set
Variables
Error rate using
Selected
Error rate using
all variables
variables
selected variables
Madelon
500
0.254
19
0.093
Dexter
20 , 000
0.324
109
0.074
Fig. 1. The importance of the top 33 out of 500 variables of Madelon derived from
a training set of 2,000 cases in 500 trees. Variable importance has a clear cut-off
point at 19 variables
5.1 Variable Selection Experiments
Initial experimentation was performed to determine whether variable selection
was necessary at all. We trained ensembles of LSCs (ELSC) for two of the data
sets. Results are given in Table 2 as the averages of tenfold cross validation.
These results clearly indicated that RLSC/ELSC is sensitive to noise vari-
ables in data, and that variable selection based on importances derived from
Random Forests works well.
For the rest of the experiments, we adopted the following variable selection
procedure. Variables are ranked by a random forest as described in Sect. 4. If
there are significant cut-off points in the ranked importance, the variable set
before the cut-off point is selected. Figure 1 shows a clear example of such a
cut-off point.
 
Search WWH ::




Custom Search