Information Technology Reference
In-Depth Information
importance are rarely selected and hence their apparent importance is similar to that
of random variables. When the number of relevant variables is small, but not very
small, then there is a good chance that one or two relevant variables will be selected
at each step. In this case the truly relevant variables have the highest chance to be
selected and hence their apparent importance is high. Finally when the number of
variables is very small in comparison with the number of total variables the chance of
truly relevant variable being included in the sample is small and this again decreases
the apparent importance of relevant variables in comparison with random ones, and
hence decreases sensitivity.
The systematic survey of range of synthetic data sets generated with varying
parameters shows that the results of Boruta algorithmare robust. While the sensitivity
may be low for systems described with very large number of variables, nevertheless,
the variables that are reported as relevant are relevant with very high probability.
2.3.2.1 Real-World Data Sets
Boruta algorithm has been also applied to four real-world data sets recently deposited
in the UCI repository (see Table 2.6 ). In this case only the false discovery ratio could
be estimated since the true relevance of the attributes is unknown. In two cases of
the sets described with small number of attributes nearly all attributes were deemed
relevant.
The level of false discovery was very low. In all cases the PPV c was 100%—not
a single false discovery was made with the strict definition of relevance. With the
more relaxed definition, accommodating also Boruta's tentative class as relevant,
some false discoveries were reported for QSAR biodegradation data set. Neverthe-
less, even in this case the expected value of false discovery was 0.4 and PPV t was
98.9%. Therefore we may assume that nearly all features identified by Boruta as
relevant are truly so. The case of QSAR biodegradation data set could suggest that
variables assigned by Boruta to tentative class, bear higher risk of being false positive.
Table 2.6 Results for the real-world data sets from the UCI repository
Data Original Contrast PPV c PPV t
Dataset Instances Va r i a b l e s Conf Tent Rej Conf Tent Rej (%) (%)
Q-b 1,055 41 36.2 0.8 4.0 0.0 0.4 40.6 100.0 98.9
TES 5,820 33 30.0 1.0 1.0 0.0 0.0 32.0 100.0 100.0
MM-500 931 1,300 293 66 941 0 0 1,300 100 100
MM-1000 931 1,300 363 58 879 0 0 1,300 100 100
ACRS 1,500 10,000 220 84 9,696 0.0 0.0 10,000.0 100.0 100.0
The MicroMass data set was analysed with Random Forest runs with 500 and 1,000 trees that are
described as MM-500 and MM-1000, respectively. The number of variables marked as confirmed
(Conf), tentative (Tent) and rejected (Rej) is reported for original and contrast variables. The PPV c
was computed according to Eq. 2.3 counting as relevant only these variables with confirmed status,
for PPV t also variables with tentative status were taken into account
 
Search WWH ::




Custom Search