Information Technology Reference
In-Depth Information
theory. Filters are general in nature and this generality should be understood here as
applicability to any domain, any inducer. This universality is, however, most often
achieved at a cost of some lower classification accuracy than for other approaches.
In wrappers selection of features is conditioned by the performance of the inducer
itself and its characteristics [ 16 ]. Typically, the predictive accuracy is considered as
the most important and deciding factor. Dependence on some particular classifier
means loss of generality and bias, but at the same time close tailoring of the set of
inputs to local requirements usually results in improved performance.
A solution is called embedded when an algorithm for feature selection and elim-
ination is a part of the learning system, some inherent dedicated mechanism that is
actively used [ 8 ]. As examples from this category there can be given construction of
decision trees, artificial neural networks with pruning of input neurons, activation of
relative reducts in rough set processing.
The chapter presents examples of combined filter, wrapper, and embedded
approaches for rule and connectionist classifiers employed for evaluation of fea-
tures in stylometric (or computational stylistics) domain, for a case of binary author-
ship attribution. The considered features reflect lexical and syntactic characteristics
expressing writing styles [ 2 ]. The stylistic features are studied and evaluated within
two contexts: firstly by their established rankings, secondly in the observed perfor-
mance of classifiers employing sequential backward selection while following these
rankings.
The text of the chapter is organised as follows. Section 3.2 presents fundamen-
tal notions of stylometric processing of texts and features used in such analysis.
Section 3.3 is dedicated to the differences in approaches to variable selection process,
while Sect. 3.4 provides some details of experimental setup, and Sects. 3.5 and 3.6
contain illustration of test results. Section 3.7 concludes the chapter.
3.2 Characteristic Features for Stylometric
Analysis of Texts
Stylometry is a branch of science dedicated to understanding of writing styles, their
characteristics and descriptive elements, shared and unique traits, aiming at knowl-
edge discovery from linguistic point of view, but also at author characterisation,
comparison, and recognition [ 7 , 31 ]. Stylometric processing typically involves either
statistic-oriented computer-aided computations [ 20 ], ormethodologies frommachine
learning domain [ 34 ]. Once we obtain a definition of a writing style by some char-
acteristic features, the task of recognising it can be perceived as pattern recognition,
with text samples categorised and classified by their authors.
A style is a phenomenon which we grasp and recognise rather intuitively, but
usually have trouble with more formal definitions and descriptions [ 3 ]. While we
can typically tell that we prefer someone's style over others, expressing the reasons
for our preferences, especially not in some general qualificatory terms such as “good”,
“bad”, “enjoyable”, “boring”, etc., but in more detail, comes much harder.
 
Search WWH ::




Custom Search