Information Technology Reference
In-Depth Information
Table 2.1 Data sets generated for the all relevant feature selection analysis
# objects # generative variables # combination variables Total variables # Trees
500
(
2
,
3
,
4
,
5
)
(
5
,
10
,
20
,
50
)
100
(500, 1,000, 2,000)
500
(
2
,
3
,
4
,
5
)
(
5
,
10
,
20
,
50
,
100
)
200
(500, 1,000, 2,000)
500
( 2 , 3 , 4 , 5 )
( 5 , 10 , 20 , 50 , 100 , 200 )
500
(500, 1,000, 2,000)
500
(
2
,
3
,
4
,
5
)
(
5
,
10
,
20
,
50
,
100
,
200
)
1,000
(500, 1,000, 2,000)
500
(
2
,
3
,
4
,
5
)
(
5
,
10
,
20
,
50
,
100
,
200
)
2,000
(500, 1,000, 2,000)
500
( 2 , 3 , 4 , 5 )
( 5 , 10 , 20 , 50 , 100 , 200 )
5,000
(500, 1,000, 2,000)
In contrast with the synthetic data sets, information on true relevance of variables
is unknown for real-world data sets. Therefore, we can measure directly neither
sensitivity nor PPV of the algorithm. However, we can estimate the PPV using
contrast variables, by measuring how many of them algorithm deems relevant. To
this end, we generate contrast variables as 'shadows' of original variables, which
are obtained by copying values of original variables and randomly permuting them
between objects. Each variable is accompanied by a shadow variable. The system
extended in this way is then analysed with the Boruta algorithm. Then the PPV
estimate is obtained as
N relevant (
X original )
PPV =
X contrast ) ,
(2.3)
N relevant (
X original ) +
N relevant (
where PPV denotes approximate PPV , N relevant (
are
respectively a number of original and contrast variables that algorithm has deemed
relevant. Entire analysis was repeated five times to check robustness of the results.
Boruta algorithm assigns variables to three classes: ( Confirmed , Tentative , Rejected ).
One can treat the Tentative class either as relevant or irrelevant, hence two measures
of PPV were used, PPV c and PPV t that differed in the assignment of the Tentative
variables. The former assigns them to irrelevant, whereas latter to relevant class.
X original )
and N relevant (
X contrast )
2.3 Results and Discussion
Four series of datasets were generated using small variations of the same approach,
nevertheless, the results differ significantly for these sets. Two series of synthetic
data sets generated with deterministic class assignment were generally difficult to
classify with random forest algorithm. The classification results for these sets were
satisfactory (OOB error less then 30%) only for low dimensional problems (2 and 3).
The problems of higher dimensionality were solvable only when large number of
objects was available. Therefore further analysis for synthetic sets was performed
for two remaining series.
 
 
Search WWH ::




Custom Search