Biology Reference
In-Depth Information
entities. Legitimate methods for testing the hypothesis that a priori groups are statistically
significantly different will be presented in later chapters.
Geometric Description of PCA
Figure 6.1A shows the simple case in which there are two observed traits, X 1 and X 2 .
These traits might be two distance measurements or the coordinates of a single landmark
in a two-dimensional shape analysis. Each point in the scatter plot represents the paired
values observed for a single specimen. We expect that the values of each trait are normally
distributed, and we expect that one trait is more variable than the other. In this case, X 1
has a larger range of observed values and a higher variance than X 2 . In addition, the
values of X 1 and X 2 are not independent; higher values of one are associated with higher
values of the other. This distribution of values can be summarized by an ellipse that is
tilted in the X 1 , X 2 coordinate plane ( Figure 6.1B ). PCA solves for the axes of this ellipse,
and uses those axes to describe the positions of individuals within that ellipse.
The first step of PCA is to find the direction through the scatter that describes the larg-
est proportion of the total variance. This direction, the long axis of the ellipse, is the first
principal component (PC1). In an idealized case like that shown in Figure 6.1A , the line
we seek is approximately the line through the two cases that have extreme values on both
variables. Real data rarely have such convenient distributions, so we need a criterion
that has more general utility. If we want to maximize the variance that the first axis
describes, then we also want to minimize the variance that it does not describe
in other
words, we want to minimize the sum of the squared distances of points away from the
line ( Figure 6.1C ). ( Note : the distances that are minimized by PCA are not the distances
minimized in conventional least-squares regression analysis
see Chapter 8.)
The next step is to describe the variation that is not described by PC1. When there are
only two original variables, this is a trivial step; all of the variation that is not described by
the first axis of the ellipse is completely described by the second axis. So, let us consider
briefly the case in which there are three observed traits: X 1 , X 2 and X 3 . This situation is
unlikely to arise in optimally superimposed landmark data, but it illustrates a generaliza-
tion that can be applied to more realistic situations. As in the previous example, all traits
are normally distributed and no trait is independent of the others. In addition, X 1 has the
largest variance and X 3 has the smallest variance. A three-dimensional model of this distri-
bution would look like a partially flattened blimp or watermelon ( Figure 6.2A ). Again PC1
is the direction in which the sample has the largest variance (the long axis of the water-
melon), but now a single line perpendicular to PC1 is not sufficient to describe the remain-
ing variance. If we cut the watermelon in half perpendicular to PC1, the cross-section is
another ellipse ( Figure 6.2B ). The individuals in the section (the seeds in the watermelon)
lie in various directions around the central point, which is where PC1 passes through the
section. Thus, the next step of the PCA is to describe the distribution of data points around
PC1, not just for the central cross-section, but also for the entire length of the watermelon.
To describe the variation that is not represented by PC1, we need to map, or project, all
of the points onto the central cross-section ( Figure 6.2C ). Imagine standing the halved
watermelon on the cut end and instantly vaporizing the pulp so that all of the seeds drop
Search WWH ::




Custom Search