Information Technology Reference
In-Depth Information
5.5 Sequential Backward Selection
Backward elimination of variables starts with inducing a solution for all available
or considered features to be treated as the reference point. Then there are tested
N subsets of variables, each with a single variable rejected, and the performance
of the classifiers compared. We would prefer the predictive accuracy to increase,
however, already one goal of backward selection is achieved through reduction of
dimensionality, so as long as the performance is not worse, we can still consider it
as satisfying requirements. From all tested subsets the one for which classification
accuracy is the highest is selected and the whole procedure repeated for N
1
attributes, then for N
2, and so on. Backward selection procedures were employed
for both connectionist and rule classifiers.
As mentioned before, for ANN backward elimination is better suited than forward
selection. Naturally all networks during the training phase learn to some degree the
relevance of particular input features and this learned knowledge is expressed by
adjustingweights of interconnections. It is also possible to exploit some input pruning
algorithms, which, however, involves rather complex calculations and processing,
whereas the general steps of backward selection are straightforward. We need to test
relatively many networks with many inputs but in such cases the classifiers converge
quickly and typically without trouble. When the numbers of inputs get lower the
training takes more time, but there are also significantly fewer such networks to be
tested.
The details from the conducted experiments in which artificial neural networks
were used as an inducer are given in Table 5.2 , where the right-most (e) column
presents the order reflecting the weights assigned to attributes by the sequential
backward selection. The elimination of “not” begins the list, and “but” was kept to the
very end of the search, which indicates its high importance. The column specifying
the classification accuracy displays median performance, because to minimise the
influence of the initial weights associated with interconnections on the results of the
learning phase the multi-starting approach was employed with repeating the learning
phase several times and accumulating the minimal, median, and maximal predictive
accuracies, plotted in Fig. 5.3 .
It can be observed that in the initial 7 steps the performance is increased with each
reduced variable to stabilise at the level of 96.67% for the next 9, then to decrease
when the number of remaining features falls below 10. Yet when compared to the
performance of the network for all 25 input variables, only in the last two steps, for
classifiers referring to respectively just two and one input, the obtained results are
worse.
The additional parameters of minimal and maximal performance can be used
not only to achieve higher reliability of obtained classification results, but also as
factors helping in selection of features. It may happen (and actually in the research
is did happen many times) that at some elimination stage several subsets of features
give the same results when only median classification accuracy is compared. Then
we can analyse the results from each step of multi-starting in more detail, check
 
Search WWH ::




Custom Search