Geoscience Reference
In-Depth Information
the objective of data analysis often involves classification or prediction. In either case, a
basic procedure is followed. A training (or “calibration”) data set is collected compris-
ing reference measurements for the properties of interest together with the measurement
attributes believed to reflect these properties (in the case of prediction) or the categories
corresponding to the samples (in the case of classification). Chemometric models are then
used to identify the “best model” of the relationship between the measurement attributes
and properties of interest (prediction) or their categories (classification). Performing
a validation in which predictions are tested using a new data set (test set validation)
or using appropriate subsets of the original data matrix (cross-validation) is critical to
ensuring that the model obtained is not overfitting and hence extensively describing ran-
dom variation (Martens and Næs, 1989 ). If the validation is successful, the prediction
good and if the training set included the full range of conditions to be expected in new
samples, then the properties of interest can in future be estimated from the measurement
variables without the need to measure them explicitly (except, perhaps, occasionally for
checking purposes). Inputs to classification and prediction models can include the output
of exploratory data analysis, for example, the principal components determined by PCA
or PARAFAC.
10.9 Multivariate Calibration
Calibration aims to develop predictive models that relate properties of interest (that may
be difficult or expensive to measure) with more easily measured attributes of the chem-
ical system, for example, spectral data (Thomas, 1994 ; Gibb et al., 2000 ). Very often, a
linear relationship may be anticipated between the measured data and the variables of
interest or else between some (possibly nonlinear) transformation thereof. Although there
are many different techniques that could be used, principal components regression (PCR)
and partial least squares regression (PLS) are obvious candidates. Both are extensions of
the multiple linear regression (MLR) model but utilize different algorithms for calculating
regression coefficients and impose different restrictions. They differ from MLR primarily
in that they are able to handle highly correlated input variables (such as adjacent wave-
lengths in fluorescence EEMs), whereas MLR requires that input variables are unique to
some extent. Also, PCR and PLS are able to handle the situation of having more variables
than samples.
In PCR, the score vectors from a PCA model are used as independent variables in an
MLR model for predicting the dependent variable. Whereas in PCR a model is found that
best reflects the covariance structure between the predictor variables (the columns of matrix
X), a PLS model reflects the covariance structure between the predictor (X) and response
(Y) variables. Thus the PLS model is optimized for predicting response. Both PCR and
PLS models provide regression coefficients for predicting Y from X. The regression coeffi-
cients are different from the regression coefficients of, for example, MLR, because in PCR
and PLS, the coefficients are found as a linear combination of the loadings in the model.
They are also not directly chemically meaningful, as in PARAFAC. In PCR, the loadings
Search WWH ::




Custom Search