Biology Reference
In-Depth Information
is the rate of error for the particular assignment method given the data, a rate contingent
on our given data; there is also an expected error rate which would be the rate we would
achieve given another data set similar to the one we have collected.
The simplest approach to estimate the actual error rate (given a particular data set) is to
use a cross-validation or jackknife procedure ( Efron, 1983; Efron and Tibshirani, 1995; Van
Bocxlaer and Schultheiß, 2010 ). In these procedures, the data set is divided into two sub-
sets, a training set and a test set. The test set may be as small as one specimen, or as much
as 50% of the data set, with the training set consisting of all remaining data. The CVA is
then fit to the training set, and used to assign the members of the training set to a group.
This process is repeated for a large number of possible variations of the test set, and the
rate of correct specimen assignment is computed over all the test sets employed. This
yields a cross-validation (or jackknife, when the test set has n
1) rate of correct specimen
assignment, which is a better predictor of the overall effectiveness of the method than
the resubstitution rate. Most modern software will have some form of jackknife or cross-
validation method available.
The difference between the resubstitution rate and the cross-validation rate of assign-
ment can be substantial. The plots of CVA scores produced by most software systems are
resubstitution rates, and so the patterns produced by these plots must be viewed with sub-
stantial caution, as they may overstate the effectiveness of the method. Nicely separated
groups on a CVA plot may not translate into effective cross-validation rates, or a statisti-
cally reliable method. It is not unusual for a CVA to indicate that one or more of the CV
axes produced were statistically significant, but to produce assignment rates no better than
expected by chance. For this reason, it seems wise to require both statistically significant
CVA axes and a cross-validation rate of correct specimen assignment that is substantially
better than chance.
It turns out to be relatively straightforward to compute the random rate of correct speci-
men assignment that can be achieved via random sorting. A biased random allocation of
specimens to groups proportional to the number of individuals in the group will yield the
highest random rate. If we have a total of N specimens distributed among m groups, and
n i members in the i th group, then the maximum rate of random assignment of specimens
to groups is given by:
5
X
m
n i
N 2
Random Rate
(6.45)
5
i
5
1
This is achieved by a random rule of assigning each specimen to a group with a ran-
dom probability n i / N . In such a situation, Equation 6.45 is the expected rate of correct
specimen assignments. If the group sizes are all equal, this rate will simply be 1/m, but if
the group sizes are unequal, the random rate will be higher than that.
In studies focusing on classification rate, rather than morphospace analysis, incorporat-
ing size into a CVA may help improve the classification rate. In such situations, it is possi-
ble simply to include the log of centroid size as an additional column in the data matrix.
As discussed in Chapter 4, the new space, sometimes called Procrustes Form Space, would
not be a shape space. Another possible approach would be to use the Procrustes Size
Preserving methods discussed in Chapter 14.
Search WWH ::




Custom Search