Biology Reference
In-Depth Information
examples, and then two illustrative studies from
the recent literature are described.
Multivariate analysis (MVA) is the statistical
analysis of many variables at once. Many prob-
lems in the analysis of life science are multi-
variate in nature. The analysis of large
multivariable data sets is a major challenge for
life science research. MVA has been made
much easier with the development of inexpen-
sive, fast computers, and powerful analytical
software. Chemometrics is the statistical analysis
of chemical data, which is an important area of
MVA. Spectral data from modern instruments
used in metabolomics such as NMR and mass
spectroscopy are fundamentally multivariate in
character. Furthermore, the powerful statistical
methods of chemometrics are essential for the
identi
where x is the average value of the n measure-
ments. The denominator in Eq. (1) is n e 1
because once the average is calculated, there
are n e 1 degrees of freedom. We note that the
standard deviation has the same units as the
variable of interest.
A plot of the scaled data is shown in Figure 1 .
The x-axis is the scaled length and the y-axis is
the scaled mass values. Each point represents
one of the 43 samples. From the plot in Figure 1 ,
two data points are far away from all of the
others. Statisticians call data points that do not
belong to the data set outliers. Outliers are
important to identify and remove from the anal-
ysis of the data set because a single outlier can
greatly in
uence the statistical analysis and
obscure underlying trends in the data. We note
that caution must be used in removing outliers
because an outlier sample may also be a critical
observation. An understanding of the under-
lying biology is often essential in the identi
cation of biomarkers and subgroups
within a given sample population. In this
chapter, we review the subject of chemometrics
and MVA and its application to the analysis of
proteomics and metabolomics data.
With metabolomics or proteomics data, it is
not uncommon to measure several thousand
variables at one time. However, it is often hard
to conceptualize so many variables; therefore,
we begin our discussion of MVA with a few
simple examples that illustrate important statis-
tical concepts which are essential in MVA. The
ca-
tion and interpretation of outliers.
The scaled data are replotted in Figure 2 with
the outlier points removed. The reader will also
note that the origin of the graph has been moved
to the center point of the data set. This operation
is called mean centering, when the average of the
overall data set is subtracted from the data. As
mentioned earlier, in MVA we are concerned
with investigation of the variation within the
data set. The average values of the data set are
not of primary importance. Two arrows in the
first problem is a set of mass and length
measurements for an animal species. The data
are shown in Table 1 . Inspection of the data
reveals that the length values are near 1.0 and
the mass values are closer to 100. A goal of the
data analysis is to understand the variation
within the data set. It will be advantageous to
have the two variables in the data set with
similar magnitudes; therefore, we scale each of
the two variables by its own standard derivation.
The standard deviation,
figure illustrate the two directions of variation
within the data set. P1 is the largest direction of
variation and P2 is the second direction of varia-
tion. It is important to note that P1 and P2 are
perpendicular to each other. In MVA, P1 and
P2 are the
first and second principal components
of the data set, respectively.
For each one of the data points, the projection
of the data point onto the P1 or P2 vector is called
a score value. Plots of score values for different
principal components, typically P1 versus P2,
are called score plots. Score plots provide impor-
tant information about how different samples
s
, of a set of measure-
ments ( x 1 ,
.
, x n ) is given by
P ð
! 1 = 2
2
x i
x
Þ
s ¼
(1)
n
1
Search WWH ::




Custom Search