Geoscience Reference
In-Depth Information
come from PCA so they are orthogonal abstract representations of the most important spec-
tral variation. In PLS, the loadings are similarly abstract, but in this case they reflect the
variation in X relevant for predicting the dependent variables. The regression coefficients
can be interpreted, though, in a similar way as MLR regression coefficients.
A multilinear version of PLS, called N-PLS, has also been developed for building
regression models for multiway data (Bro, 1996 ). The loadings of an N-PLS model of
EEMs are analogous to PARAFAC loadings, that is, a single excitation spectrum, a single
emission spectrum, and a vector of regression coefficients for each latent variable in the
model. Unlike with PARAFAC, however, the latent variables identified by N-PLS are opti-
mized for predicting the response matrix and would not normally represent pure chemical
spectra. Hence, the N-PLS loadings are similar in that respect to ordinary two-way PLS
loadings.
For the Horsens catchment data set, the correlation between fluorescence and DOC
implied by the principal component analysis suggests that it may be possible to predict
DOC from fluorescence intensity. Previously, Vasel and Praet ( 2002 ) attempted to pre-
dict DOC and TOC in a wastewater treatment plant from fluorescence emission scans
(300-450 nm) obtained at an excitation wavelength of 280 nm, finding only poor cor-
relations ( R 2 < 0.4) and low predictive success. Much greater success in predicting DOC
from fluorescence ( R 2 > 0.9) was documented by Marhaba et al. ( 2003 ), using a three-
component PCR model developed from 69 EEMs (excitation 225-500 nm and emis-
sion 231-633 nm) collected from a canal supplying various New Jersey water treatment
utilities.
For the current example, the unfolded EEMs are used to predict DOC concentrations
in the Horsens catchment using PLS regression. Preprocessing is by mean-centering only
because as was described earlier, normalization would remove concentration-related infor-
mation inherent in the fluorescence data that is relevant to predicting DOC concentra-
tions. Cross-validation indicates that a model with two latent variables has a relatively
high correlation coefficient (cross-validated R 2 cv = 0.91); however, inspection of the model
predictions ( Figure 10.10 ) indicates that DOC in the estuary sites are poorly predicted in a
combined site model owing to their low concentrations and low influence on the model. A
solution in this case is to construct two separate models: one for the river and WTP sites,
and another for the estuary sites. Table 10.2 summarizes the PLS results relevant to select-
ing the number of latent variables (LVs) in each model. When choosing the optimal model,
there is a trade-off between selecting a small number of LVs, as increasing the number of
latent variables increases the risk of overfit, causing poorer predictions with future data
and minimizing cross-validated root-mean-square-error of prediction (RMSECV). Models
with lower RMSECV also have higher correlation coefficients, R 2 cv . Although models are
often selected to have minimum RMSECV, it is not always the case that the model with
lowest RMSECV is the best. It has been observed that PLS is “eager-to-please,” meaning
that it can easily produce over-optimistic assessments of predictive capability, especially
in situations where there are many more variables than samples (e.g., in the case of many
data sets of unfolded EEMs) or when cross-validation is inadequate (e.g., leave-one-out
Search WWH ::




Custom Search