Information Technology Reference
In-Depth Information
Fig. 2.3 The OOB error for two series of HYPER data sets, one with points described with genera-
tive variables only, and the other with additional combination features. The number of combination
features is two times the number of generative features. The datasets with combination features are
marked with a '*'
One result that may be less intuitive is that introduction of features that are lin-
ear combinations of original variables may improve the classification. These fea-
tures in some cases may form lower dimensional subspace that allows to separate
clusters located originally in corners of the hypercube. This is not universal, but
observed for the last series. Hence presence of the combination features in the data
set may facilitate transition of a problem that is formally N-dimensional to easier
(N-k)-dimensional one. The effect is displayed in Fig. 2.3 . The classification error is
significantly lower for series with original generative features augmented with lin-
ear combinations. This result shows that relationship between importance and true
relevance may not be straightforward.
2.3.2 Feature Selection
In line with expectations the results of the feature selection are correlated with the
results of the classifications. It is difficult to identify important features for data sets
that are difficult to classify and relatively easy for those that are easy to classify.
This is clearly visible in Table 2.2 that collects the overall results of the survey of
the synthetic data sets. For the XOR series the sensitivity is very high for easy
2-dimensional data sets and drops to 25% for hard 5-dimensional data sets. On the
other hand, the level of false discovery is uniformly low—the expected value of false
positive discovery is 0.3. It means that on average only 3 falsely relevant variables
 
Search WWH ::




Custom Search