Biology Reference
In-Depth Information
available, based on indicators of clustering or
predictive ability, respectively. This selection
can be performed by considering each variable
independently or with respect to others. An indi-
vidual evaluation is however unable to account
for interactions between variables and synergistic
effects may be missed. Two main approaches are
available: the
Data Redundancy/Correlation
Suppression
By aggravating the curse of dimensionality,
data redundancy and correlations are often
undesirable and detrimental to modeling. This
issue can be addressed by selecting one of the
correlated variables within a group or by
building new variables starting from the
original correlated variables (variable construc-
tion/transformation). Computing correlation
between sets of variables constitutes a simple
way to evaluate their degree of redundancy.
The Pearson correlation between the variables
can be computed and the correlation pro
filter and the wrapper methods. 60
The former performs the selection of variables
by assessing a quality index for individual or
groups of variables prior to the modeling step.
Simple criteria such as Student
s t -test and anal-
ysis of variance or more sophisticated strategies,
including recursive features elimination 61
or information gain 62 can be applied. Some
strategies based on variable ranking require
the number of selected variables to be set
manually.
On the other hand, wrapper approaches
involve appropriate data modeling. Once
a model is built, the individual contributions of
the variables can be used as a quality index for
further selection, such as regression coef
'
les
analyzed to assess the strength of the linear rela-
tionship. A high positive correlation would
suggest a similar regulatory control; a correlation
closer to zero would suggest no relationship and
a strong negative correlation would imply nega-
tive feedback loops. 66 Additionally, a corre-
sponding statistical value can be computed to
determine whether an observed correlation is
likely to be attributed to a biological phenom-
enon or to a random process. Nevertheless,
a value close to zero does not imply an absence
of relationship but only of linear relationship.
As it relies on the ranks of the two series rather
than their numerical values,
cients
for variables
weights. The selection is therefore
integrated in the modeling procedure and such
an approach may be computer-intensive in the
case of very large data sets. Additionally, result-
ing subsets of variables depend heavily on the
choice of the modeling algorithm and its
parameters. 63
When using a data compression method,
such as PLS regression (as discussed later in
this chapter), models based on latent variables
are obtained. In that context, the variable
importance in projection (VIP) constitutes an
attractive criterion for variable selection. 64 A
VIP value close to or greater than 1 is related
to an important variable in the model.
Inversely, values close to zero indicate irrele-
vant variables that can be excluded. Moreover,
other tools such as the selectivity ratio plot and
the discriminating variable (DIVA) test were
developed for the selection of the most discrim-
inating variables in spectral or chromatographic
pro
'
the Spearman
'
s
rank correlation coef
cient constitutes another
simple solution to cope with the linearity
assumption. On the other hand, correlation
does not imply causation, regardless of the way
in which it is evaluated.
As an example, the correlation-based feature
selection algorithm 67 intends the parsimonious
selection of nonredundant subsets of predictive
variables. Both the predictive merit and the
degree of correlation are balanced to promote
a high correlation with the class information
and a low level of intercorrelation within the
subset. It can therefore drastically reduce the
number of variables in highly intercorrelated
metabolomic data without loss of prediction
performance. Reducing redundancy should be
les. 65
Search WWH ::




Custom Search