Weighting of Features by Sequential Selection - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

For ANN classifiers it is more natural to apply backward elimination, because

networks with excessive numbers of inputs still can perform better than those with

insufficient features. In the training phase networks can detect just by themselves

which input variables are less important and assign to them low weights of intercon-

nections, minimising their influence on the outcome. On the other hand, when there

are not enough of characteristic features, the network can try and generalise, yet the

conclusions cannot be drawn from nothing. As a result neural networks with only

few inputs typically need more time to be trained, can have trouble converging and

then generalising for unknown samples.

Induction of decision rules takes significantly less time for fewer attributes. How-

ever, they do not necessarily contain information required to infer rules with good

quality, resulting in high predictive accuracy. Applying forward selection procedures

we can not only choose the attributes that are the most beneficial to rule induction

process, but also adjust their preference orders which further increases performance.

Typically minimal cover decision algorithms give worse results than rule classifiers

constructed in different approaches, for example all rules on examples with some

hard constrains on constituent rules such as minimal support required. Yet inferring

all rules when there are many attributes requires a lot of computations and takes

time. Since in subsequent stages of backward elimination many of generated rules

would be the same, as the studied subsets of features are overlapping, we can employ

another methodology, inwhich backward reduction is in fact applied to rules referring

to rejected features.

For all search paths tested one of the important elements to consider is the stopping

criterion, answering the question when or where the selection procedures should end.

The response is not trivial as it depends to a high degree on the purpose of applying

the search procedures in the first place. When the goal is just to find a good subset

of features, that is resulting in an induced solution with satisfyingly high predictive

accuracy, we can stop the search once we detect a maximum in correct recognition

ratio. However, if we do it to quickly, before checking alternative subsets, it may

turn out that a maximum is only local and not global, and for some other candidate

subset of variables the predictive accuracy is better.

If extended processing is acceptable, or with the goal of weighting available

variables we test all possible subsets of variables in a search path. We do observe

the performance (after all the choice is conditioned by it), but we also study the

order in which all features are organised. This order reflects their weighting from

the perspective of applied search procedure and inducer employed. As classifiers

have different characteristics and the selection of variables is wrapped around their

performance, the same search direction applied for another classification system,

with distinctively different properties may return completely different ranking of

attributes. From all validated subsets we can choose the best, or we can impose the

obtained ranking of features on a separate classification process and test its usefulness

as a filter.

All attribute selection procedures were illustrated for a binary classification task

with balanced data, for the problem of authorship attribution from stylometric

processing of texts. The most important aim of textual analysis is to find definitions

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home