Biology Reference
In-Depth Information
have been developed. Their use is more or less
user-friendly, depending on the implementation
of a graphical or a command-line user interface.
Free bioinformatics tools include MZmine, 46
XCMS, 27 MetAlign, 47 MSFACTs, 48 TagFinder, 49
MET-IDEA, 50 MathDAMP, 51 msInspect, 52
OpenMS 53 and MetaboliteDetector. 54 Commer-
cial solutions are often vendor-speci
variables d are less prone to over
tting and ensure
easier model interpretation. 59
Variable selection usually involves a two-step
procedure: the generation of variable subsets
and the estimation of their respective predictive
or clustering ability. Single variables or groups
of potentially interacting biomarkers can be
assessed. Three major approaches can be envis-
aged to reduce data dimensionality by selecting
individual or groups of biomarkers: predictive/
clustering ability, redundancy removal, and bio-
logical knowledge integration. These aspects can
be combined to obtain highly predictive subsets
of compounds of biological interest. An over-
view of the variable selection process is provided
in Figure 3 .
c such as
MarkerLynx
based
on the ChromAlign algorithm. 55 Relevant
reviews concerning this particular point are
regularly updated.
, Mass Pro
ler
, or Sieve
VARIABLE SELECTION
The number of variables is a pivotal aspect of
data modeling because it de
nes the size of the
hypothesis space. 56 When dealing with multi-
variate data of high dimensionality, biologically
relevant hypotheses are harder to
Selection by Modeling/Prediction
Variable selection based on modeling/predic-
tion requires some quality index to detect the
most appropriate combination of variables.
Unsupervised and supervised methods are
find. Variable
selection aims at reducing data dimensionality
by selecting subsets of pertinent variables for
robust and parsimonious models. It is therefore
closely related to the detection and selection of
relevant biomarkers. Additionally, reducing the
number of variables speeds up model computa-
tion and improves its stability. Moreover, the
prediction performance of most supervised
models can be improved by a prior variable
selection, as their accuracy is often decreased
by highly correlated or irrelevant variables.
A pertinent data subset should therefore include
representative variables, able to retain the salient
characteristics of the data to provide a more
compact and interpretable image of the phenom-
enon under study. 57
The number of variables needed for data
modeling depends directly on the data mining
algorithm. Although decision trees may require
only a handful of relevant biomarkers for classifi-
-
cation, other strategies such as partial least squares
(PLS) or nearest neighbor classi
cation make
use of all variables. 58 So-called sparse classi
ers d
that
is, models based on a limited number
FIGURE 3 Variable selection procedure.
Search WWH ::




Custom Search