Biomedical Engineering Reference
In-Depth Information
Consequently numerous methods have been proposed for finding a (suboptimal)
solution by testing a fraction of the possible combinations.
Feature selection methods can be divided into three main types [4]:
1. Wrapper methods: The feature selection is performed around (and with)
a given classification algorithm. The classification algorithm is used for
ranking possible feature combinations.
2. Embedded methods: The feature selection is embedded within the classifi-
cation algorithm.
3. Filter methods: Features are selected for classification independently of the
classification algorithm.
The simplest wrapper algorithms are exhaustive search (which, as noted above
is practical only when the number of features is small), sequential forward feature
selection (SFFS), and sequential backward feature selection (SBFS). More advanced
wrapper algorithms include simulated annealing and genetic algorithms [5]. Se-
quential forward feature selection works by first selecting the feature with which
the lowest classification error is reached and then greedily adding features that
provide the largest reduction in error. This process is continued until the classifi-
cation error starts to increase. The underlying assumption of this method is that
features can be evaluated individually. While this assumption often works well in
practice, it may cause this method to miss combinations of features that are useful
only when selected together.
Because wrapper algorithms use a classification algorithm as the scoring func-
tion, it is easy to parallelize them if the classification algorithm can be parallelized.
Alternatively, methods such as SFFS and SBFS can be parallelized by exploiting
the fact that features are evaluated independently and thus, different features can
be evaluated by different processors.
One example of the filtering approach to feature selection is that developed
by Koller and Sahami [6]. Here, information theoretic considerations are used
to decide on the most informative features, without the need for a classification
algorithm. This method estimates the cross-entropy between every pair of features
and discards those features that have a large cross-entropy with other features, thus
removing features that add little additional classification information. However,
because the pair-wise estimate of the cross-entropy between features needs to be
computed, this algorithm scales quadratically with the number of features. Thus,
only data with a small to medium number of features can be evaluated using this
method. We note, however, that this method is easy to parallelize because of the
independence between the estimation procedure for each pair of features.
An altogether different approach to reducing dimensions is through feature
reduction, that is, finding a low-dimensional combination of features. This com-
bination can be linear or nonlinear.
Arguably the most well-known of these methods is principal component anal-
ysis [PCA, or Karhunen-Loeve Transform (KLT)]. PCA reshapes the data along
directions of maximal variance. This is done by finding the eigenvectors corre-
sponding to the largest eigenvalues of the covariance matrix of the data, and
projecting the data along these eigenvectors. Thus, the low-dimensional features
Search WWH ::




Custom Search