Weighting of Features by Sequential Selection - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

5.5 Sequential Backward Selection

Backward elimination of variables starts with inducing a solution for all available

or considered features to be treated as the reference point. Then there are tested

N subsets of variables, each with a single variable rejected, and the performance

of the classifiers compared. We would prefer the predictive accuracy to increase,

however, already one goal of backward selection is achieved through reduction of

dimensionality, so as long as the performance is not worse, we can still consider it

as satisfying requirements. From all tested subsets the one for which classification

accuracy is the highest is selected and the whole procedure repeated for N

−

1

attributes, then for N

2, and so on. Backward selection procedures were employed

for both connectionist and rule classifiers.

As mentioned before, for ANN backward elimination is better suited than forward

selection. Naturally all networks during the training phase learn to some degree the

relevance of particular input features and this learned knowledge is expressed by

adjustingweights of interconnections. It is also possible to exploit some input pruning

algorithms, which, however, involves rather complex calculations and processing,

whereas the general steps of backward selection are straightforward. We need to test

relatively many networks with many inputs but in such cases the classifiers converge

quickly and typically without trouble. When the numbers of inputs get lower the

training takes more time, but there are also significantly fewer such networks to be

tested.

The details from the conducted experiments in which artificial neural networks

were used as an inducer are given in Table 5.2 , where the right-most (e) column

presents the order reflecting the weights assigned to attributes by the sequential

backward selection. The elimination of “not” begins the list, and “but” was kept to the

very end of the search, which indicates its high importance. The column specifying

the classification accuracy displays median performance, because to minimise the

influence of the initial weights associated with interconnections on the results of the

learning phase the multi-starting approach was employed with repeating the learning

phase several times and accumulating the minimal, median, and maximal predictive

accuracies, plotted in Fig. 5.3 .

It can be observed that in the initial 7 steps the performance is increased with each

reduced variable to stabilise at the level of 96.67% for the next 9, then to decrease

when the number of remaining features falls below 10. Yet when compared to the

performance of the network for all 25 input variables, only in the last two steps, for

classifiers referring to respectively just two and one input, the obtained results are

worse.

The additional parameters of minimal and maximal performance can be used

not only to achieve higher reliability of obtained classification results, but also as

factors helping in selection of features. It may happen (and actually in the research

is did happen many times) that at some elimination stage several subsets of features

give the same results when only median classification accuracy is compared. Then

we can analyse the results from each step of multi-starting in more detail, check

−

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home