Geoscience Reference
In-Depth Information
Although chemometric analysis is often implemented on unscaled fluorescence EEMs
without prior preprocessing (Gurden et al., 2001 ; Bro and Smilde, 2003 ), this approach
may be unsuitable for data sets encompassing large concentration gradients. Samples with
higher concentration may exert disproportionately high leverage on the model because
models by default concentrate on minimizing differences between high- and low-concen-
tration samples. As a general rule, when data sets encompass large concentration ranges
(i.e., varying orders of magnitude) it can be helpful to normalize the area of each EEM to
ensure that the modeling is focused on the chemical variations rather than on the magnitude
of total signals. This is done by scaling the data in the first (sample) mode to unit norm, that
is, dividing by the sum of the squared value of all variables for the sample. Normalization
(and other operations affecting rows) should be performed before column operations (such
as scaling and mean centering). It is important to note that this normalization or scaling of
each sample can also be reversed after estimating the model. That is, scores in the origi-
nal scale can be obtained by scaling the scores according to the inverse of that which the
samples were scaled by.
Preprocessing of multivariate and multiway data sets prior to regression and discrimi-
nant analysis follow the general principles outlined earlier with few exceptions. In general,
the response matrix (i.e., the data that are to be predicted) should be mean centered because
this serves an additional purpose in regression and classification models. By centering both
the dependent and independent variables, any possible differences in offsets are removed.
Row normalization can be implemented if the priority is to establish a relationship between
variables, rather than estimate the magnitude of the response, or to stabilize the impact of
differently concentrated samples on models, as previously described. For example, if the
calibration model is intended to predict a concentration from data that follow the Beer-
Lambert law (e.g., fluorescence), then it is crucial not to normalize as this would cause the
loss of concentration information. If, on the other hand, the model is intended to classify
samples, then normalization may help the model focus on patterns rather than on concen-
tration-induced variations.
10.4 Exploratory Data Analysis
Most chemometric treatments of DOM fluorescence data to date have been directed
towards identifying patterns within data sets (unsupervised pattern recognition or cluster
analysis), or deduce the underlying structure of individual EEMs (spectral decomposition)
( Table 10.1 ). These are exploratory techniques in the sense that they are geared toward
identifying structures within data sets in order to generate hypotheses about what variables
may be important for various purposes (e.g., for classification or prediction), but do not
involve hypothesis testing per se. Exploratory data analysis includes methods for analysing
both multivariate and multiway data sets. Cluster analysis is used to sort similar samples
into categories, such that two samples with similar measured variables belong to the same
group and two samples that have very different measurements belong to different groups.
Similarity is measured on the basis of some algorithmically determined distance. Thus
Search WWH ::




Custom Search