Feature Evaluation by Filter, Wrapper, and Embedded Approaches - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

theory. Filters are general in nature and this generality should be understood here as

applicability to any domain, any inducer. This universality is, however, most often

achieved at a cost of some lower classification accuracy than for other approaches.

In wrappers selection of features is conditioned by the performance of the inducer

itself and its characteristics [ 16 ]. Typically, the predictive accuracy is considered as

the most important and deciding factor. Dependence on some particular classifier

means loss of generality and bias, but at the same time close tailoring of the set of

inputs to local requirements usually results in improved performance.

A solution is called embedded when an algorithm for feature selection and elim-

ination is a part of the learning system, some inherent dedicated mechanism that is

actively used [ 8 ]. As examples from this category there can be given construction of

decision trees, artificial neural networks with pruning of input neurons, activation of

relative reducts in rough set processing.

The chapter presents examples of combined filter, wrapper, and embedded

approaches for rule and connectionist classifiers employed for evaluation of fea-

tures in stylometric (or computational stylistics) domain, for a case of binary author-

ship attribution. The considered features reflect lexical and syntactic characteristics

expressing writing styles [ 2 ]. The stylistic features are studied and evaluated within

two contexts: firstly by their established rankings, secondly in the observed perfor-

mance of classifiers employing sequential backward selection while following these

rankings.

The text of the chapter is organised as follows. Section 3.2 presents fundamen-

tal notions of stylometric processing of texts and features used in such analysis.

Section 3.3 is dedicated to the differences in approaches to variable selection process,

while Sect. 3.4 provides some details of experimental setup, and Sects. 3.5 and 3.6

contain illustration of test results. Section 3.7 concludes the chapter.

3.2 Characteristic Features for Stylometric

Analysis of Texts

Stylometry is a branch of science dedicated to understanding of writing styles, their

characteristics and descriptive elements, shared and unique traits, aiming at knowl-

edge discovery from linguistic point of view, but also at author characterisation,

comparison, and recognition [ 7 , 31 ]. Stylometric processing typically involves either

statistic-oriented computer-aided computations [ 20 ], ormethodologies frommachine

learning domain [ 34 ]. Once we obtain a definition of a writing style by some char-

acteristic features, the task of recognising it can be perceived as pattern recognition,

with text samples categorised and classified by their authors.

A style is a phenomenon which we grasp and recognise rather intuitively, but

usually have trouble with more formal definitions and descriptions [ 3 ]. While we

can typically tell that we prefer someone's style over others, expressing the reasons

for our preferences, especially not in some general qualificatory terms such as “good”,

“bad”, “enjoyable”, “boring”, etc., but in more detail, comes much harder.

Search WWH ::

Custom Search

Home