Agriculture Reference
In-Depth Information
8.4.1 Principal component analysis
The goal of principal component analysis (PCA) is to extract maximum variance
from the original data set with a few components. Each component, expressed as a
linear combination of original variables, is a unique mathematical solution. In
practical terms, PCA identifies variables that duplicate information held in other
variables, and hence might be considered redundant. It is based on the study of the
pairwise product-moment correlation coefficients computed in a data set. In general,
if there are p variables, the relation between the p measured (and/or derived)
variables on disease development, y 1 , y 2 ,…, y p and p principal component ( c 1 , c 2 ,…,
c p ) is
c
=
a
y
+
a
y
+
+
a
y
1
11
1
12
2
1
p
p
c
=
a
y
+
a
y
+
+
a
y
(8.10)
2
21
1
22
2
2
p
p
c
=
a
y
+
a
y
+
+
a
y
p
p
1
1
p
2
2
pp
p
Collectively, the information in the complete set of principal components is
equivalent to that held in the original data set. Where only a reduced number of
principal components is retained, there is some loss of information. The percentage
of information retained is expressed in the cumulative percentage of variation
explained. The first component explained the most variation in the original data; the
second explained the next highest and so on. Normally, if there are strong
correlations among original variables, as often is the case for data describing various
aspects of temporal epidemic patterns, no more than three PCAs are retained.
However, there are no formal criteria to verify the resulting PCA structure.
It is essential to understand how the information that was held in the original
data is now contained in the reduced dimensions. A further rotation of axis in the
reduced dimension may be necessary to better highlight the relations among the
measured variables. Several ways are possible to achieve the 'best' rotation. One
method often used is the varimax rotation, which selects axes that minimise the
number of axes on which each measured variable is highly loaded. This tends
to result in each variable highly correlated on only one axis, which helps
interpretations. Usually, orthogonal rotation techniques are used, leading to
statistically independent PCAs; varimax rotation is one of the orthogonal techniques.
From the correlation matrix between original variables and retained newly rotated
PCAs, we can interpret each PCA. We may now use the retained PCAs for further
analysis of comparing epidemic development. Clear biological/epidemiological
interpretations of each main PCA are crucial to the PCA analysis. If each of the
retained PCAs cannot be explained biologically or epidemiologically, it may not be
worthwhile proceeding to further analysis with the PCA scores.
In general, variables are standardised first before PCA analysis. Lack of
standardisation makes sense only if the variables are all measured on the same scale;
this is the case if PCA is applied to original disease incidence/severity data recorded
over time. Standardisation should be applied to a collection of variables measured
Search WWH ::




Custom Search