Weighting of Features by Sequential Selection - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

attribute another one was added and 24 two attribute subsets were prepared and rules

for them generated. Again selection of the best algorithm ended the processing at

this stage. The analogous procedure was executed at all stages that followed.

At the first and all subsequent stages there were generated four algorithms for

each considered subset: two with the conditional attribute of cost (decreasing) type,

and two for gain (increasing) type, for both minimal cover and all rules on examples

algorithm. From inferred algorithms all possible and approximate rules were next

excluded, then an algorithm was tested with respect to the maximal predictive accu-

racy. To this aim for all algorithms there were introduced hard constraints on rules

with respect to minimal support required of rules to be taken under consideration in

classification. In most cases these requirements resulted in increased performance.

The details of conducted experiments are listed in Table 5.1 .

As ambiguous cases of no rule matching the testing samples or contradicting

decisions were always treated as incorrect, the performance of these rule classifiers

in the initial phase, when there are only few considered features, is rather poor.

However, it increases quickly and gradually with each added attribute. For just few

conditional attributes from which rules are inferred, the two types of algorithms,

minimal cover and all rules on examples, are not that different, with similar numbers

of constituent rules and close performance level. Once there are more features the

differences are more distinct.

The first local maximum is detected for the subset of just five attributes, for which

all rules on examples algorithm limited by rejecting rules with support lower than 7

classifies correctly 91.67% of samples. The best performance for six variables for

this type of algorithm is lower, 88.33%. Yet for the same subset the minimal cover

algorithm has predictive accuracy of 91.67%, which is kept at the same level also for

seven features before it decreases to 83.33% for eight attributes. The performance of

all best rule classifiers at each stage in shown in Fig. 5.2 for both minimal cover and

all rules on examples decision algorithms, denoted as MCDA and FDA respectively.

In forward selection approach with each iterative step of the procedure we deal

with more and more variables and at each step we can ask the question whether it

is enough, whether we have the set of features that satisfy our requirements. The

answer is not straightforward. Even when the predictive accuracy is considered as

the most important factor on which the decision is based, it is not a simple task of

reaching some maximum, as upon finding it we cannot possibly know if this is of

local character or global, and after some decreased performance for another subset

in the search path we can encounter another local maximum. We know what the

maximum is only when all possible subsets of attributes are tested (all possible on

the selected search path, which is not exhaustive), that is including the entire set of

available variables.

When we can afford the extended processing of search procedures executed with-

out additional stopping criteria, the observed performance for subsets of variables,

with gradually increasing cardinalities, can be used as means to feature weighting

and ranking, to be employed for another inducer as a kind of filter. Or, we can fin-

ish the variable selection procedure by choosing such subset of features for which

the classification accuracy was the highest when compared to all tested alternatives.

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home