Information Technology Reference
In-Depth Information
To employ contemporary data mining techniques for stylometric analysis [ 1 ],
quantitative instead of qualitative descriptors are required and they are based on
statistics of linguistic features. As their selection reflects the richness of language,
the list of existing possibilities is practically endless. The markers often exploit term
frequencies of occurrence [ 24 ] and are divided into four categories: lexical, syntactic,
structural, and content specific [ 30 ].
Lexical descriptors provide information about total numbers of characters or
words, averages of numbers of characters per word or sentence, words per sentence,
distributions of these numbers. Syntactic markers express the structure of sentences
as created by punctuation marks [ 4 ]. Structural attributes reflect the overall organi-
sation of a text into paragraphs, sections, headings, signatures, embedded formatting
elements. Content-specific features refer to words and phrases of key meaning in
some context [ 6 ]. Out of these four groups, typically in authorship attribution tasks
there are chosen lexical and syntactic descriptors [ 41 ].
Even though it is universally acknowledged that it is possible to execute reliable
authorship attribution while using stylometric descriptors as characteristic features,
there is no consensus with regard to the way in which these sets of variables should
be constructed. Of course, as always there is needed a sufficiently high number of
representative text samples, but basing on themvarious candidate subsets of attributes
can be prepared and the knowledge about their efficiency and relevance for the
purposes of classification is unavailable a priori.
With the absence of domain knowledge about the importance of attributes a dif-
ferent attitude can be tried, by applying some methodology that by itself can discover
relevance of variables, or some approaches dedicated to feature selection and reduc-
tion, either single, or in combinations. Even when expert knowledge is available,
feature selection algorithms can help with dimensionality reduction, improvement
of obtained results. When the task of feature set construction is considered in the
context of data processing and mining, it can be biased by a particular technique
used, with the result of the possible existence of alternative feature sets, found by
other approaches or computations. Thus the widely accepted procedure is to propose
some candidate set of attributes (chosen by arbitrary assumptions, using statistics,
or heuristics), test its quality, and optimise it for the set criteria.
3.3 Approaches to Feature Selection
Inmany classification tasks the total number of possible features that can be employed
is relatively high. Using all of them would result in respectively high dimensionality
which encumbers processing (even may make it impractical), also the presence of
too many variables is a drawback to most inducers even when these attributes by
themselves are relevant for the task, not to mention irrelevant or redundant variables
which can obscure other patterns [ 18 ]. In such cases several candidate subsets can
be tried and their efficiency tested, or we can employ some of algorithms explicitly
dedicated to feature selection and reduction.
 
Search WWH ::




Custom Search