Information Technology Reference
In-Depth Information
measure, considered then a weight or rank, assigned through processing. Therefore,
we obtain in such case a ranking of variables [ 27 ].
When a ranking of attributes is exploited in some processing, for feature selection
or reduction, and it is executed independently on the procedures that led to ranking in
the first place, by a formal definition the ranking can be then perceived as a filter. Even
a wrapper can be employed as a ranking filter in the subsequent stage of calculations,
as long as it follows a search path that gives the ordering of all variables, and the
inducer from the second stage is different from the first one.
3.4 Details of Research Framework
Before conducting any experiments several decisions needed to be made, with
respect to:
￿
input data sets—defined by the numbers of analysed learning and testing samples,
and a set of available stylometric characteristic features,
￿
machine learning techniques used in classification,
￿
the point in the feature space where search procedures start and directions of the
search,
￿
the stopping criterion for the search process,
￿
evaluation method for a candidate variable subset,
￿
organisation of the search,
as described below in more detail.
3.4.1 Input Data Sets
To ensure reliability of detected patterns in linguistic habits and preferences, sta-
tistics must be obtained basing on several samples of writing, with each sample of
sufficient length. In the considered case there were taken novels by two famous writ-
ers, Thomas Hardy and Henry James. Since within documents that are so long it is
natural to perceive some small variations of styles depending on the character of text
parts (narrative or dialog), they were divided into smaller samples, corresponding to
chapters or sections, to keep comparable length and size.
For all these prepared parts next the characteristic features were extracted for 25
arbitrarily selected lexical and syntactic descriptors (which proved to be useful in
some past research on authorship attribution [ 35 , 36 ]), by calculation of frequencies
of usage for some function words and punctuation marks as follows:
￿
lexical markers (17)—but, and, not, in, with, that, what, for, by, if, from, at, to, as,
on, of, this,
￿
syntactic markers (8)—a comma, a fullstop, a colon, a semicolon, a bracket, a
question mark, an exclamation mark, a hyphen.
 
Search WWH ::




Custom Search