Database Reference
In-Depth Information
Chapter 13
Feature Selection
13.1 Overview
Dimensionality (i.e. the number of dataset attributes or groups of
attributes) constitutes a serious obstacle to the eciency of most induction
algorithms, primarily because induction algorithms are computationally
intensive. Feature selection is an effective way to deal with dimensionality.
The objective of feature selection is to identify those features in
the dataset which are important, and discard others as irrelevant and
redundant. Since feature selection reduces the dimensionality of the data,
data mining algorithms can be operated faster and more effectively by using
feature selection. The reason for the improved performance is mainly due
to a more compact, easily interpreted representation of the target concept
[ George and Foster (2000) ] . We differentiate between three main strategies
for feature selection: filter, wrapper and embedded [ Blum and Langley
(1997) ] .
13.2 The “Curse of Dimensionality”
High dimensionality of the input (that is, the number of attributes)
increases the size of the search space in an exponential manner and thus
increases the chance that the inducer will find spurious classifiers that
are not valid in general. It is well known that the required number of
labeled samples for supervised classification increases as a function of
dimensionality [ Jimenez and Landgrebe (1998) ] . Fukunaga (1990) showed
that the required number of training samples is linearly related to the
dimensionality for a linear classifier and to the square of the dimensionality
for a quadratic classifier. In terms of non-parametric classifiers like decision
trees, the situation is even more severe. It has been estimated that as
203
Search WWH ::




Custom Search