Biology Reference
In-Depth Information
be used to approximate the original data set. For
a single mean-centered observation, x j ,
origin of the score plot corresponds to the
average of the entire data set. The samples that
are farther away from the origin are more likely
to be possible outliers. Inspection of Figure 4
shows that the I. setosa samples are well sepa-
rated from the other two, but the I. versicolor
and I. virginica nearly overlap.
The ellipse in Figure 4 is called the Hotelling
T 2 ellipse and is showing the 95% probability
level for outliers. The Hotelling T 2 ellipse
is based on scaled, squared score values. The
T 2 value for observation is given in Eq. (9) :
p
X
X
A
x j ¼
t ji PC i ¼
t ji PC i þ
E
(7)
i
¼
1
i
¼
1
where t ji are the score values, A is the number of
principal components, and E is the error when
the number of principal components is less
than the number of variables. Because the PCs
are orthogonal, a direct expression for the score
values can be given by the following equation:
x j X $PC i
P i ¼ 1 t ia
N
X
A
t ia
S ta
t j;i ¼
(8)
T i ¼
S ta ¼
(9)
a
¼
1
Equation (8) is derivable from Eq. (7) by
taking a dot product of both sides and exploiting
the orthogonality of the principal components.
The previous example is somewhat trivial
because only two variables were involved.
Our next example multivariable data set was
originally used by Fisher. 6,7 Four taxonomic
measurements (sepal length, sepal width, petal
length, and petal width) are given for three vari-
eties of
where A is the number of principal components
and t ia is the a th principal component score value
for the i ith sample. T 2 is closely related to the
often-used parameter Mahalanobis distance.
An important property of the T 2 statistic is that
it is directly proportional to an F value, which
is a statistical parameter that is rigorously related
to a probability value. *
The numerical value of the F value is depen-
dent on the number of samples, principal
components, and probability level desired, a .
Examination of Eq. (8) for two PCs shows that
owers: Iris setosa , Iris vericolor , and Iris
virginica . There are 25 samples of each variety.
The data can be found in table form in the refer-
ences. 6,7 This reference also includes pictures of
the three varieties of irises used. Using commer-
cial software, we can do a PCA analysis of the
data set using the same approach that was
used for the
C t 1
!
t 2
S 2
F a ¼
S 1 þ
(10)
first data set, namely scaling by stan-
dard deviation and mean centering. A few of the
critical
where C is a constant. Equation (10) is an equa-
tion for an ellipse in the t 1 , t 2 space. By conven-
tion, the Hotelling T 2 ellipse is usually drawn
at the 95% probability level.
PCA can be viewed as a method for approxi-
mating the original data set. The approximation
is based on a linear combination of the principal
components where the amplitude coef
results are shown in the following
figures. The loading (principal component) plot
shows some results that are clearly interpretable
( Figure 3 ). The principal component plot shows
how different variables relate to each other. In
the plot, the reader can observe that petal length
and petal width are very close to each other and
are therefore well correlated to each other. A plot
of the score values for each one of the 75 iris
cients
are the previously described scores. The approx-
imation is exact when the number of principal
components equals the number of variables in
flower samples is shown in Figure 4 . The three
iris varieties are color coded (blue, I. setosa ;
green, I. versicolor ; and red, I. virginica ). The
* T i ð
N
A
Þ
N
is approximately F -distributed. 4
A
ð
N 2
1
Þ
Search WWH ::




Custom Search