Information Technology Reference
In-Depth Information
5.2.4 Textual Analysis
Some input data sets directly determine data mining techniques that can be applied
to themwhile for others many alternative approaches can be used. Stylometry, which
was the application domain in the research on weighting characteristic features in
forward and backward selection illustrated in this chapter, refers to stylistic textual
descriptors, reflecting individual linguistic preferences of writers [ 8 ]. In processing
there are employed either computer-aided and statistic-oriented computations [ 22 ],
or methodologies from machine learning area [ 1 , 42 ].
While text categorisation with respect to a subject content uses some key words
and phrases of specific significance [ 6 ], categorisation by text authors, which is
considered as the most important of stylometric tasks, needs to detect more subtle
linguistic elements because we want to recognise who has written a text regardless
of what it is about [ 7 ].
In stylometric processing typically there are exploited textual descriptors
employed rather subconsciously, based on common parts of speech. Under more
detailed analysis they reveal patterns corresponding to individual habits and prefer-
ences, invisible to the bare eye, which makes them hard to imitate [ 2 ].
Even though linguists agree that we have individual writing styles, they cannot
really help when asked for style definitions. Since styles are unique, they cannot be
expressed by any general rule that would be universal and applicable to all writers
and all texts [ 3 ]. Instead for any author a set of discriminating features needs to be
established by tests.
The markers the most popularly used in authorship attribution come from either
lexical or syntactic group. Lexical descriptors give such numerical statistics as fre-
quencies of occurrences, distributions of frequencies, and averages for characters,
words, and phrases [ 28 ]. Syntactic markers express organisation of a text in units
such as sentences and paragraphs by punctuation marks [ 4 ].
5.3 Experimental Setting
To be reliable, all numerical characteristics need to be calculated over some sufficient
number of representative writing samples. In fact, the bigger the corpus, the higher
chance at good recognition ratio. That is why in experiments there were used novels
written by two writers, Henry James and Thomas Hardy, divided into smaller parts
of comparable length. All texts used in the experiments performed are available in
electronic formats for download and on-line reading thanks to Project Gutenberg
( http://www.gutenberg.org ) .
To avoid the problems that can result from imbalanced data sets used in classifi-
cation, in both groups of samples exactly one half corresponds to one author and the
other half to the second one, making the classification binary.
 
Search WWH ::




Custom Search