Feature Selection - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

two subspace based separability measures to determine the individual discriminatory

power of the features, namely the common subspace measure and Fisher subspace

measure, which can easily be used for detecting the discrimination capabilities for

FS. After demonstrating that the existence of sufficiently correlated features can

always prevent selecting the optimal feature set, in [ 64 ], the redundancy-constrained

FS (RCFS) method was proposed. Recent studies include FS via dependence max-

imization [ 51 ], using the Hilbert-Schmidt independence criterion. Furthermore, the

similarity preserving FS was presented in [ 62 ].

The use of meta-heuristics is widely extended in FS. In [ 44 ], a genetic algorithm

is employed to optimize a vector of feature weights with the KNN classifier allowing

both FS and extraction tasks. A tabu search algorithm is introduced in [ 61 ], using 0/1

bit string for representing solutions and an evaluation measure based on error rates.

More advanced hybridizations of genetic algorithms with local search operations

have been also applied to FS [ 40 ]. Similar to the one previoulsy mentioned, the

approach defined in [ 65 ] combines a wrapper-based genetic algorithm with a filter-

based local search. An iterative version of Relief , called I-RELIEF, is proposed in

[ 52 ] by exploring the framework of the EM algorithm.

One of the most successful paradigms used in FS is the Rough Sets theory. Since

the appearance of the application of rough sets in pattern recognition [ 54 ], lots of FS

methods have based their evaluation criteria in reducts and approximations accord-

ing to this theory. Due to the fact that complete searches are not feasible for large

sized data sets, the stochastic approaches based on meta-heuristics combined with

rough sets evaluation criteria have been also analyzed. In particular, Particle Swarm

optimization has been used for this task [ 58 ]. However, the main limitation of rough

set-based attribute selection in the literature is the restrictive requirement that all data

is discrete. For solving this problem, the authors in [ 20 ] proposed an approach based

on fuzzy-rough sets, fuzzy rough FS (FRFS). In a later paper, in [ 9 ], a generalization

of the FS based on rough sets is showed using fuzzy tolerance relations. Another

way of evaluating numerical features is to generalize the model with neighborhood

relations and introduce a neighborhood rough set model [ 17 ]. The neighborhood

model is used to reduce numerical and categorical features by assigning different

thresholds for different kinds of attributes.

The fusion of filters and wrappers in FS has also been studied in the literature.

In [ 56 ], the evaluation criterion merges dependency, coefficients of correlations and

error estimation by KNN. As we have mentioned before, the memetic FS algorithm

proposed in [ 65 ] also combines wrapper and filter evaluation criteria. The method

GAMIFS [ 14 ] can be viewed as a genetic algorithm to form an hybrid filter/wrapper

feature selector. On the other hand, the fusion of predictive models in form of ensem-

bles can generate a compact subset of non-redundant features [ 55 ] when data is wide,

dirty, mixed with both numerical and categorical predictors, and may contain inter-

active effects that require complex models. The algorithm proposed here follows a

process divided into four stages and considers a Random Forest ensemble: (1) iden-

tification of important variables, (2) computation of masking scores, (3) removal of

masked variables and (4) generation of residuals for incremental adjustment.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home