Open source software for mass spectrometry and metabolomics - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

■ Chemometrics with R from the topic Chemometrics with R -

Multivariate Data Analysis in the Natural Sciences and Life Sciences

by R Wehrens [39]. This package contains PCA and MCR routines.

■ Chemometrics. This package is the R companion to the topic

Introduction to Multivariate Statistical Analysis in Chemometrics by

K Varmuza and P Filzmoser (2009) [40]. This includes PCA, PLS,

clustering, self-organising maps and support vector machines.

■ pls by R Wehrens and B-H Mevik [41]. Contains both PLS and PCR

methods. This package is easily adapted for PLS-DA using a categorical

Y variable denoting class membership (i.e. 0=control 1= treated).

■ pcaMethods [42], initiated at the Max-Planck Institute for Molecular

Plant Physiology, Golm, Germany. Now developed at CAS-MPG

Partner Institute for Computational Biology (PICB) Shanghai, P.R.

China and RIKEN Plant Science Center, Yokohama, Japan. pcaMethods

has a number of alternative PCA methods for missing data including

NIPALS and support for cross-validation.

■ Kopls [38]. An implementation of the kernel-based orthogonal

projections to latent structures (K-OPLS) method for MATLAB and R.

The package includes cross-validation, kernel parameter optimisation,

model diagnostics and plot tools.

4.7.1 Important considerations with

multivariate analysis

The most critical aspect of multivariate analysis is the ability to estimate

the predictive power, or model stability. This is usually implemented

using cross-validation [45] where some data are sequentially left out of

the model and the model re-calculated. The left-out data are then

estimated from the model and the differences are summarised in a

parameter called Q2, the predictive variance. Without an estimate of

predictivity, there is no objective way to estimate the optimum number of

components or even if any components are actually predictive at all. The

variance explained or R2 of a model will keep increasing with every

component and so there is a great danger of overfi tting the model if this

is the only criterion used to judge the model.

The ability to estimate predictivity becomes of paramount importance

when using supervised methods such as PLS-DA. Without the measure of

Q2 it may be possible to get discriminant models which are effectively

worthless, for example getting separations with random data [45]. In

addition to cross-validation, permutation testing is also a highly effective

Search WWH ::

Custom Search

Home