Feature Evaluation by Filter, Wrapper, and Embedded Approaches - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

To employ contemporary data mining techniques for stylometric analysis [ 1 ],

quantitative instead of qualitative descriptors are required and they are based on

statistics of linguistic features. As their selection reflects the richness of language,

the list of existing possibilities is practically endless. The markers often exploit term

frequencies of occurrence [ 24 ] and are divided into four categories: lexical, syntactic,

structural, and content specific [ 30 ].

Lexical descriptors provide information about total numbers of characters or

words, averages of numbers of characters per word or sentence, words per sentence,

distributions of these numbers. Syntactic markers express the structure of sentences

as created by punctuation marks [ 4 ]. Structural attributes reflect the overall organi-

sation of a text into paragraphs, sections, headings, signatures, embedded formatting

elements. Content-specific features refer to words and phrases of key meaning in

some context [ 6 ]. Out of these four groups, typically in authorship attribution tasks

there are chosen lexical and syntactic descriptors [ 41 ].

Even though it is universally acknowledged that it is possible to execute reliable

authorship attribution while using stylometric descriptors as characteristic features,

there is no consensus with regard to the way in which these sets of variables should

be constructed. Of course, as always there is needed a sufficiently high number of

representative text samples, but basing on themvarious candidate subsets of attributes

can be prepared and the knowledge about their efficiency and relevance for the

purposes of classification is unavailable a priori.

With the absence of domain knowledge about the importance of attributes a dif-

ferent attitude can be tried, by applying some methodology that by itself can discover

relevance of variables, or some approaches dedicated to feature selection and reduc-

tion, either single, or in combinations. Even when expert knowledge is available,

feature selection algorithms can help with dimensionality reduction, improvement

of obtained results. When the task of feature set construction is considered in the

context of data processing and mining, it can be biased by a particular technique

used, with the result of the possible existence of alternative feature sets, found by

other approaches or computations. Thus the widely accepted procedure is to propose

some candidate set of attributes (chosen by arbitrary assumptions, using statistics,

or heuristics), test its quality, and optimise it for the set criteria.

3.3 Approaches to Feature Selection

Inmany classification tasks the total number of possible features that can be employed

is relatively high. Using all of them would result in respectively high dimensionality

which encumbers processing (even may make it impractical), also the presence of

too many variables is a drawback to most inducers even when these attributes by

themselves are relevant for the task, not to mention irrelevant or redundant variables

which can obscure other patterns [ 18 ]. In such cases several candidate subsets can

be tried and their efficiency tested, or we can employ some of algorithms explicitly

dedicated to feature selection and reduction.

Search WWH ::

Custom Search

Home