Environmental Engineering Reference
In-Depth Information
in turn. More details on performance assessment and LV model diagnostic tools can
be found in [4, 21].
3.2.3.1 Scaling of Data Matrices
All projection methods described previously are scale dependent and therefore, ap-
propriate scaling needs to be performed on all measurements prior to analyzing
them. When no prior process knowledge is available, one common practice is to
scale all variables to unit variance, as this gives them equal importance in the model,
with respect to one another. However, if prior knowledge exists, then scaling should
be modified accordingly. For example, if it is common knowledge that a particular
set of variables is roughly twice as important as another set, the most important set
could be scaled to twice the variance of the less important set of variables. In this
chapter, mean-centering and scaling to unit variance was applied prior to building
the latent variable models unless otherwise stated.
3.2.3.2 Number of Components
Another important issue in building empirical models with projection methods is to
select the number of components to keep in the model in a meaningful way. The
most widely used method for selecting the number of components in projection
methods is cross-validation [16]. This method suggests to keep adding latent vari-
ables to the model as long as they significantly improve the predictions of the model
(PCA or PLS). Model predictive ability can be evaluated using the predictive multi-
ple correlation coefficient Q 2
is the total
prediction error sum of squares obtained by cross-validating a model with a latent
variables. This is performed by dividing the I observations included in the database
( X and/or Y , see Figure 3.2(b) into g groups of size q ( I
(
a
)=
1
PRESS
(
a
)/
SS r
(
a
1
)
. PRESS
(
a
)
gq ). Then, each group is
deleted one at a time and a PCA or PLS model with a latent variables is built on the
remaining g
=
1 groups. The prediction error sum of squares is then computed for the
group not used to build the model. PRESS
(
a
)
is the total of the prediction error sum
of squares for all groups. SS r
(
a
1
)
is just the residual sum of squares of a model
1 latent variables. As long as Q 2 is greater than zero, the a th dimension
is improving the predictive power of the model. Therefore, one should keep adding
latent variables until Q 2 is consistently lower than zero. Statistical hypothesis tests
are sometimes used to verify if the a th dimension has led to a sufficient increase in
Q 2 to be added to the model. Several other criteria have been developed for select-
ing an appropriate number of latent variables. Overviews of criteria available in the
signal processing and chemometrics literature are available in [22, 23].
Another statistic used for assessing the fit of the latent variable models is the fit
multiple correlation coefficient R 2
with a
(
)
a
, or alternatively, the explained variance for a
model with a latent variables, R 2
is
the residual sum of squares of a model (PCA or PLS) with a latent variables, while
(
a
)=
1
SS r
(
a
)/
SS tot . In this statistic, SS r
(
a
)
Search WWH ::




Custom Search