Geoscience Reference
In-Depth Information
Information on preprocessing in this chapter is necessarily brief; for broader discussion
readers are referred to more comprehensive accounts (Thomas, 1994 ; Naes et al., 2002 ;
Bro and Smilde, 2003 ). In general, preprocessing should be based on specific aims such as
removing the Rayleigh scattering or giving minor peaks a chance to enter the model. When
such practical concerns are guiding the preprocessing, it is often simpler to choose the
appropriate tools. An illustration of the role of preprocessing on the principal component
analysis (PCA) of fluorescence EEMs is provided later in this chapter.
Wavelength selection is sometimes overlooked in the preprocessing of spectral data.
It should be borne in mind that although an instrument may be capable of collecting data
across a wide range of excitation and emission wavelengths, such data may be of variable
quality and importance. In particular, depending on the type and condition of the spectropho-
tometer light source, and the characteristics of the sample, data obtained at low excitation
wavelengths can have very high uncertainties owing to a combination of factors, including
decreasing lamp output, decreasing transmission efficiency of the excitation monochrom-
eter, and increasing light reabsorption by the sample (inner filter effects) (Lakowicz, 2006 ).
For this reason, it is often advisable to exclude fluorescence data obtained at low excitation
wavelengths. Alternatively, it should be verified that the inclusion of such data do not skew
the results of chemometric analysis. If combining fluorescence data sets collected using
more than one fluorometer, intercalibration before analysis is also necessary to ameliorate
the effects of instrument biases (Cory et al., 2010 ; Murphy et al., 2010 ). Data affected by
phenomena unrelated to fluorescence (e.g., Rayleigh and Raman scatter) should always be
removed prior to modeling (Andersen and Bro, 2003 ).
In multivariate matrices in which the variables are not “smoothly” related as they
are in spectral data sets, then the preprocessing needs to make sure that a variable is not
essentially disregarded just because it has a small scale. For spectral data, a small value
usually implies little information, but for discrete data a variable this is not necessar-
ily the case. For example, one variable may be temperature that varies only over a few
degrees; however, this is not necessarily less important than another variable measured
in weight that varies over thousands of milligrams. To take into account where scales for
different variables are not proportional to their importance, analyses on “non-smooth”
data are typically performed after first mean centering (subtracting the column average
from each column) and scaling (dividing each column by its standard deviation). These
steps are often referred to in combination as “auto-scaling” (Thomas, 1994 ). The center-
ing removes the common features, so that the PCA model focuses on differences between
the samples. The scaling gives each variable equal weighting in the model, rather than
giving greater weighting to variables that naturally exhibit a larger absolute range. In
the case of spectra, differences between wavelengths in intensity ranges are chemically
meaningful and the auto-scaling of variables (and similar operations operating on the
columns of the data set) can distort genuine proportional relationships between wave-
lengths, especially in data sets that span large concentration ranges. Thus for spectral
data, often no preprocessing is needed, although mean centering may be convenient for
visualization (Bro and Smilde, 2003 ).
Search WWH ::




Custom Search