Measuring variation in natural populations: a primer (Genetics)

1. Sources of data

Most measures of variation (synonymous with diversity) in natural populations quantify the average amount of difference between two entities from a defined population. For example, biologists may be interested in differences in size or length among organisms in a population, differences among DNA or protein sequences from those organisms, and so on.

Variation in quantitative traits (synonymous with metric traits) such as mass, length, or color is quantified with standard statistics such as means, variances, and covariances. Many work with the logarithms of measurements rather than with the measurements themselves. The statistics have useful properties when the measured quantities are normally distributed, and the distributions of logarithms of metric traits are often more like normal distributions.

Classical markers are polymorphisms where DNA variation is detected indirectly rather than by sequencing. The first to be studied systematically were immuno-logical markers such as the ABO and Rh loci. At the ABO locus, for example, a chromosome could code for A-substance, B-substance, or no substance. Over time, technologies appeared that increased the number of markers: electrophoresis allowed protein variants to be distinguished, and restriction enzymes that recognize and cut a specific short nucleotide sequence led to a large number of markers for restriction fragment length polymorphisms (RFLPs) where the alleles were either cut or not cut by a specific enzyme. Methods of measuring diversity for all these are essentially the same, computation of expected heterozygosity under random mating. As large-scale sequencing of DNA has become feasible, these classical markers have fallen into disuse.


Classical markers, electrophoretic loci, and RFLPs are compromised and not so useful for the assessment of population differences in diversity on a global scale. A landmark study of classical markers in humans revealed essentially no interesting patterns nor striking population differences (Cavalli-Sforza et al., 1994), while subsequent work on DNA sequences and on repeat polymorphisms invariably shows that diversity within populations is greater in African populations. The problem is ascertainment bias: the markers were discovered mostly in European populations, so the sample of marker loci available is biased toward loci that are most variable in Europeans. This skews population diversity comparisons, and excess diversity in African populations is obscured. The human genome project has led to the discovery of a large number of single nucleotide polymorphisms (SNPs), millions of them, but the usefulness of these for population studies is again compromised by the ascertainment problem. The origins of the chromosomes used for discovery are not publicly known.

Some of this difficulty with ascertainment is overcome with the use of variable number of tandem repeat (VNTR) markers. VNTR loci are places on the chromosome where there are long stutters of a motif. For example, a locus where there are repeats of the nucleotide sequence ATAG would be a tetranucleotide polymorphism because the motif is four bases long. Short VNTRs, also called short tandem repeats (STRs) or microsatellites, with motifs of two to four repeats have been discovered and published in large numbers. Because of the high diversity of these markers, the ascertainment problem is not great, and they are useful for population comparisons (Rogers and Jorde, 1996).

In recent decades, technology has become available to make direct study of DNA sequences from populations possible. The first sequence collections published were from mitochondrial DNA because it is available at high concentrations in cells and is haploid (see Article 4, Studies of human genetic history using the Y chromosome, Volume 1 and Article 5, Studies of human genetic history using mtDNA variation, Volume 1). More recently, nuclear autosomal sequences have become widely available. In general, sequence data do not suffer from the ascertainment bias that interferes with interpretations of classical markers, RFLPs, and SNPs.

2. Measuring and describing variation

It is important to be aware of the distinction between purely descriptive statistics and interpretations of those statistics in terms of a model of evolution. Most of the statistics I discuss below can be justified in terms of an evolutionary model but they are perfectly useful and routinely used when the conditions of the model are violated. When they are model-bound, it is usually appropriate to make small corrections to account for estimation bias: these issues are treated well in Nei (1987).

Almost all descriptive statistics about variation measure average differences between entities, whether organisms or DNA sequences or repeat lengths. A common justification is that differences accumulate since the separation time of the entities and the statistics are estimates of that time. The time may be an average coalescence time between DNA sequences or, in the case of organisms, an average coalescence time over all the loci in those organisms that contribute to variation in the trait.

Much theory is based on the special case of neutral genes in a panmictic population that has been of constant size for a long time, 4 to 8 G generations where G is the number of genes transmitted in the population each generation. For quantitative traits of whole organisms, the assumption is that a large number of neutral mutations of small effect accumulate along the lines of descent of loci that affect the trait. For VNTRs, the assumption is that mutations cause alleles to change in length by single steps, that loss or gain of a motif are equally likely, and that allele length is selectively neutral. For DNA sequences, the assumption is that mutations occur along the sequence according to a Poisson process, that the mutations are neutral, and that there is such a large number of nucleotide positions relative to the number of mutations that any single nucleotide position never experienced more than a single mutation event: this is the infinite sites assumption.

In the case of quantitative traits, the means, variances, and covariances among the traits, or, the logarithms of the traits, are the ordinary description of variation. If the heritabilities of the traits are known, then the covariance matrix among the traits can be transformed to be proportional to variation of genes underlying the traits. The heritability of a trait is the fraction of the variance that is due to variation in the additive effects of the underlying genes. When there are several traits, the heritability is a matrix, but in practice, detailed knowledge of heritabilities in natural populations is not available. Workers generally choose a plausible figure, perhaps 30 to 50%, and assume it applies to all the traits. An example of this approach is found in Relethford and Harpending (1994), who compared world human variation in craniometric traits with variation in classical markers and found the two data sets to agree closely.

Variation in a set of classical markers or RFLPs or SNPs is measured simply by averaging the expected heterozygosity over all the loci, where expected heterozygosity is computed by computing allele frequencies, then summing the product of each frequency and its complement over all the alleles. This sum has no theoretical interpretation because of ascertainment issues. It is possible to compare closely related populations but, as I mentioned above, global comparisons are apparently meaningless. As an example as part of a study of Ashkenazi Jewish genetics (Cochran and Harpending, 2004), we collected marker frequencies at 144 autosomal loci from several European populations from the excellent ALFRED website (ALFRED, 2004) maintained by Kenneth and Judy Kidd at Yale University. Relative to a baseline of 1.0 for mixed Europeans, the average heterozygosity of Ashkenazi was 0.988, of Russians 0.988, and of Samaritans 0.846. The reduced heterozygosity of Samaritans is a signature of the population bottleneck in their history, while the figure for Ashkenazi Jews, comparable to that of Russians, shows that there was no detectable bottleneck at all in Ashkenazi history. This is of interest because the high frequency of certain inherited disorders in Ashkenazi is often attributed to genetic drift during a bottleneck, but the data show no sign of a bottleneck.

Variation in a sample of STR data can be measured as heterozygosity, the probability that two alleles are not the same, but a better measure is allele size variance. The single-step mutation model of STRs leads to the expectation that the mean squared difference in size between two alleles is proportional to their coalescence time. The mean squared allele size difference is just twice the variance of allele size, so overall variation is computed as the average variance of all loci. Empirically this is not satisfactory because there are usually outlier loci with very large variances that dominate the results. In comparisons of variation between populations, it is better to work with rank-ordered differences in variance rather than with variances themselves. (Jorde et al., 1997).

There are two standard ways to measure variation in a collection of DNA sequences. One is the mean pairwise sequence difference (MPD), the average \number of nucleotide differences between all pairs of sequences. This is equivalent to heterozygosity computed at each nucleotide position and summed over all sites in the sequence. Since the number of mutations along a genealogy of a pair of sequences under a simple model is on average proportional to the length of the genealogy, MPD is another estimator of average separation time of a set of sequences. As an example, there is on the order of a single nucleotide difference between two human DNA sequences per 1000 nucleotide positions, so the MPD would be equal to 1 for kilobyte sequences, 5 for five kilobyte sequences, and so on.

Another statistic appropriate for sequences is the normalized number of segregating sites in the sample: under standard simplifying assumptions, the number of segregating sites divided by J21/i, where n is the number of sequences, should be equal to the MPD. The difference between the two numbers is the basis of the Tajima statistic (Tajima, 1989) used to test for selection or demographic change in a population. This is the only statistic discussed in this article that is not a measure of the average separation time of chromosomes.

All the above statistics describe variation within a population. It is often of interest to describe how different populations are from each other, variation among populations. The simple way to do this is to use any of the above statistics to compute variation separately within each of the subpopulations, take the average, and call it Vs. Then pool the subpopulations as if they were all members of the same population, compute variation in this pooled sample, and call it Vt. From these, compute the fraction Fst = (Vt – Vs)/Vt to describe the fraction of total variation that is between populations. This is invariably 10 to 15% among major human populations whether measured on metric traits, classical markers, VNTRs, or DNA sequences. Therefore, at neutral markers, the differences among major human populations correspond roughly to the differences among sets of half siblings from a random-mating population (12.5%) (see Article 2, Modeling human genetic history, Volume 1).

3. Interpreting pattern in diversity

So far I have discussed statistics that are single numbers that quantify the amount of diversity in a population. With many kinds of genetic data, there is more information to be gotten by looking at the patterning of diversity. With a collection of DNA sequences we can quantify diversity by the average pairwise sequence difference, but we can also look beyond the average at the distribution of all pairwise sequence differences. Similarly, the number of segregating sites is the basis of another number that quantifies diversity, as described above. Some segregating sites are present as singletons, sites where there is one nucleotide in a single sequence and another nucleotide in the n – 1 other n sequences, while other segregating sites may have each nucleotide present in several of the sequences. Sequences in a sample are tips of a tree of descent, and characteristics of the sample of sequences may allow us to infer characteristics of that tree and from there something about population history or natural selection at the locus.

Figure 1 shows the history of a sample of seven sequences from a population. As we go backward in time, the number of ancestors of the sample decreases as pairs of sequences coalesce in common ancestors. For example, if there were two cousins in a sample of mtDNA sequences, daughters of sisters, then two generations ago their mtDNAs coalesced into the mtDNA carried by their maternal grandmother. We do not usually know what the tree that generated a set of data was like, but several simple statistics are useful to infer properties of the tree.

Gene tree showing the history of seven genes sampled from a population. The top of the tree, the coalescence of the sample, is usually 1 to 2 million years in the past for most human genes. Red circles show where mutations occurred

Figure 1 Gene tree showing the history of seven genes sampled from a population. The top of the tree, the coalescence of the sample, is usually 1 to 2 million years in the past for most human genes. Red circles show where mutations occurred

The tree in Figure 1 is typical of gene trees of neutral genes in populations that have not undergone drastic changes in size: the top of the tree for many human nuclear regions is on the order of 1.5 to 2 million years old. In the history of these sequences, six mutations occurred, indicated by circles on the tree. Mutations occur at random along the branches, and the probability of a mutation in any branch is proportional to the length of the branch. The oldest mutation in the tree is present in sequences E, F, and G, splitting the sample into 4 without and 3 with the mutation. Sequences B, C, and D share a mutation that also splits the sample into 3 with and 4 without. Sequences F, G, and A each carry a unique mutation, a singleton. Finally, sequences C and D share a mutation. If we tabulate segregating sites according to the number of copies of the mutation, we have 3 mutations that occur in 1 sequence, 1 that occurs in 2 sequences, and 2 that occur in 3 sequences. This tabulation is called a frequency spectrum, and we will see below that different demographic or selective histories can lead to different and distinctive frequency spectra.

Figure 2 shows a coalescence tree that looks very different from that in Figure 1: it is shaped more like a comb or a star. There are at first few coalescence events, going backward in time, but then coalescences occur rapidly. This tree might represent human mitochondrial DNA, with the top of the tree at about 200 000 years ago. Since the rate of coalescence at any time is proportional to the reciprocal of the population size, Figure 2 suggests that the population from which we drew our samples was large until some time in the past, before which it was small so that coalescence events happened fast. In our sample of seven sequences from the tree in Figure 2, six are singletons while only one, that shared by sequences C and D, occurs in two of the seven sequences. This excess of singletons is a signature of a star-like gene genealogy, suggesting that an episode of rapid population growth generated the data. Another interpretation is that a new selectively advantageous variant occurred about 200 000 years ago and spread rapidly and that today’s sequences are all derived from it. In general, we cannot distinguish between an episode of rapid population growth and a selective sweep of a new advantageous variant: both of these have been proposed to account for the clear star-like genealogical history of human mitochondrial DNA.

Gene tree showing the history of seven genes sampled from a population as in Figure 1. The population underwent population growth from a small number in the past, just after the coalescence. Human mitochondrial DNA has a history like this, with the top of the tree about 250 000 years ago

Figure 2 Gene tree showing the history of seven genes sampled from a population as in Figure 1. The population underwent population growth from a small number in the past, just after the coalescence. Human mitochondrial DNA has a history like this, with the top of the tree about 250 000 years ago

There is another property of the diversity portrayed in Figure 2 that is useful to compute. Imagine that there were many more mutations on the tree of Figure 2 that occurred at random along the branches. If we were to tabulate all possible pairwise sequence differences, they would all be similar because of the star-like structure of the tree. In Figure 1, sequences C and D are similar since not much time separates them, while in Figure 2 the difference between C and D is not very different from other pairwise differences. A histogram of pairwise differences is called a mismatch distribution, and a smooth unimodal mismatch distribution is a signature of a comb or star-like tree as in Figure 2.

Figure 3 shows a coalescence tree from a locus that has been under balancing selection. Selection that somehow favors a variant as it becomes less common can maintain separate gene lineages for much longer than they would persist if they were neutral. Figure 3 might represent, for example, the history of several human

Gene tree showing the history of seven genes sampled from a population as in Figure 1. At this locus balancing, selection maintains both lineages in the population so the coalescence is very old. For human HLA system genes, for example, the top of the tree may be several tens of millions of years old

Figure 3 Gene tree showing the history of seven genes sampled from a population as in Figure 1. At this locus balancing, selection maintains both lineages in the population so the coalescence is very old. For human HLA system genes, for example, the top of the tree may be several tens of millions of years old

HLA alleles: in this system, uncommon variants are favored and alleles have deep gene trees, on the order of tens of millions of years.

Six random mutations are shown on this coalescence tree. Most of them are on the long deepest branches and divide the sample into 3 and 4: one mutation is present as a singleton, in chromosome D. The frequency spectrum has a “hump in the middle” in this case since there are more segregating sites with alleles of intermediate frequency than standard neutral theory predicts (see Article 7, Genetic signatures of natural selection, Volume 1).

There are standard statistics, provided by well-tested computer packages, that describe and quantify these insights, but their evaluation often involves complicated numerical calculations or even simulations. The simple graphical arguments given here are the bases for the more formal statistics.

Next post:

Previous post: