Population genomics: patterns of genetic variation within populations


1. Polymorphism

Polymorphism at the nucleotide level ranges over at least an order of magnitude within species, and average polymorphism ranges over two orders of magnitude between species. Homo sapiens is among the least polymorphic of all species, with a heterozygous single nucleotide polymorphism (SNP) generally occurring once every 500 to 1000 bp (International SNP Map Working Group, 2001). By contrast, marine invertebrates such as the sea squirt and echinoderms have an astonishing level of sequence diversity with a SNP every 5 to 10 bp (Dehal et al., 2002). Diversity is a function of organism-level factors such as population size, generation time, and breeding structure (Aquadro et al., 2001), but variation within and among chromosomes signifies that recombination and mutation rates are also critical (Begun and Aquadro, 1992; Charlesworth et al., 1995). In most species, centromeric and telomeric regions are less recombinogenic, hence have smaller effective population sizes, and tend to be less polymorphic (Nachman, 2002). Even within a locus, polymorphism can vary over an order of magnitude, according primarily to functional constraint: synonymous substitution rates tend to be uniform, whereas replacements can be excluded from highly conserved domains. Noncoding gene sequences are typically more polymorphic than exons and less polymorphic than intergenic DNA, but core regulatory sequences up to several hundred basepairs in length may often be the most conserved of all sequences (Wray et al., 2003).

Significant disparity between two measures of polymorphism, namely, the number of segregating sites and the average heterozygosity, provides evidence for departure from “neutrality” (Hudson et al., 1987; Kreitman, 2000). However, neutrality comes in many flavors, and demographic processes are just as likely to affect the difference between these two measures as is selection (Nielsen, 2001). Heterozygosity is a function of allele frequency as well as density, so unexpectedly high or low numbers of heterozygotes relative to the number of SNPs in a population can arise as a result of several processes that may be superimposed on random drift. Thus, rapid population expansion or strong purifying selection both reduce heterozygosity, whereas admixture or balancing selection will increase heterozygosity. Tests such as Tajima’s D (Tajima, 1989) have remained useful descriptors of diversity, but have been joined by a new series of tests that are more firmly rooted in coalescent theory (Wall and Hudson, 2001). Rather than strictly interpreting test scores relative to theoretical expectations, comparison of the distribution of test scores across tens or hundreds of loci among species emphasizes that diversity is affected by a complex interplay of factors and that it is the location of a gene at either extreme of the continuum that marks it as a candidate target of selection, rather than a p-value per se (Hey, 1999; Bustamante et al., 2002).

A trend toward empirical evaluation of significance by permutation in light of genomic data is also seen in relation to population structure. Standard F-statistics introduced by Sewall Wright based on differences in genotype frequencies among populations (Weir and Hill, 2002) have been extended into an analysis of molecular variance (AMOVA) framework, one popular implementation of which is the Arlequin software (Schneider et al., 2000). Estimates of SNP, indel, hap-lotype, or microsatellite allele frequency differences are sensitive to sample size, so samples of at least 100 individuals per population are recommended. Using genomic data, the multiple comparison issue also arises: in a set of 500 sites, a single site with a testwise p-value of 0.0001 is not unexpected, but in a large sample this may correspond to an allele frequency difference of just 10%. Consequently, population structure is best estimated from multilocus data. For example, Pritchard et al. (2000) have introduced Bayesian statistics to assign individuals to likely sub-populations with numerous applications in evolutionary, conservation, quantitative, and human genetics. It is well known that over 90% of all human polymorphism is common to all populations, but the ability to genotype hundreds of loci has led to the recognition that given sufficient data there is a detectable signature of demographic history even in our species (Rosenberg et al., 2002). Similarly, long-held assumptions of panmixia in Drosophila melanogaster are being challenged by deeper sampling (Glinka et al., 2003), as are commonly held notions about the genetic uniformity of crops such as maize (Matsuoka et al., 2002), and in fact the power to discriminate population structure in most species will have a profound impact on quantitative biology. An important implication of the ability to detect population structure is inference of departure from neutrality, by comparison of the observed F-statistics with those obtained from a collection of assumed neutral markers (Lewontin and Krakauer, 1973; Rockman et al., 2003).

The advent of new sequencing and genotyping technologies will only accelerate the data-driven nature of evolutionary genetic research (see Article 7, Single molecule array-based sequencing, Volume 3). ABI 3730 automated DNA sequencing machines routinely generate traces with over 1 kb of high-quality sequence and have a throughput capacity exceeding 1 Mb per day. Single-molecule sequencing methods are expected to make the sequencing of complete eukaryotic genomes for $1000 each a reality, possibly in the next decade (Meldrum, 2000), while massively parallel resequencing by hybridization to wafers of tiled oligonu-cleotides has already been used to characterize polymorphism between primate species (Frazer et al., 2003). Such studies have identified hundreds of loci that are candidates for the adaptive evolution in the recent human lineage, some of which are likely to contribute to the etiology of common disease (Tishkoff and Verrilli, 2003; Clark et al., 2003). Molecular evolutionary studies of single genes in samples of 30 individuals have been typical but will soon be dwarfed by genome-scale sampling, and increasingly, attention will be placed on the efficient sampling design and formulation of hypotheses that utilize patterns of variation across the genome to interpret unusual patterns of variation at focal loci. Describing the variance of standard population-genetic parameters at a genome-wide scale is unprecedented territory, and developing approaches to quantify this variation across these expansive contiguous regions is the challenge for the near future. This type of data will also allow reexamination of some of the most basic assumptions underlying many population-genetic approaches, such as the infinite sites and island migration models.

2. Recombination and linkage disequilibrium

Recombination and mutation are the two biochemical processes that influence the distribution of molecular variation. Recombination can be directly measured by monitoring the coinheritance of markers transmitted from parent to offspring, but with the exception of technically demanding single sperm typing (Jeffreys et al., 2000); the resolution of this method is of the order of just centimorgans or hundreds of kilobases. Since an important consequence of recombination is its effect on linkage disequilibrium over scales from tens of bases to tens of kilo-bases, indirect methods for measuring recombination have been introduced based on population-genetic measurement of the cosegregation of markers (Hudson and Kaplan, 1985; Stumpf and McVean, 2003). Linkage disequilibrium (LD) is the nonrandom assortment of genetic markers: given two alleles each at a frequency of 20%, just 4% of individual chromosomes should have both alleles if assortment is random, but physically adjacent markers will often cosegregate more often. In this case, the maximum possible LD would have 20% of the chromosomes with both less common alleles, and 80% with both common alleles. Two commonly used statistics measure this departure from randomness, D’ and r2, only the latter of which explicitly takes allele frequencies into account (Hill and Robertson, 1966; Weir, 1996). A further technical challenge in the measurement of LD is establishing the linkage phase of double heterozygotes, which can be addressed directly by studying trios of parents and their offspring (which is however impractical for many species) or computationally with EM likelihood algorithms (Fallin and Schork, 2000; Stephens et al., 2001).

Quantitative geneticists have long been interested in LD because detection of association between markers and phenotypes is dependent on LD between anonymous markers and the causative disease or quantitative trait nucleotide(s) (Zonder-van and Cardon, 2004). This idea has given rise to the human HapMap project, which is an effort to describe the complete pattern of haplotypes in the human genome (International HapMap Consortium, 2003). Haplotypes are sets of multi-locus alleles, and because of LD they tend to be less common than chance would predict: there are 32 possible ways that five biallelic alleles can combine, but typically just a handful of these will be at any appreciable frequency in a population. Standard population-genetic theory predicts that LD should decay monotonically with distance, but at least in the human genome it now appears that there are often fairly discrete boundaries that define haplotype blocks that range in length from 10 to 100 kb or more (Gabriel et al., 2002; see also Article 12, Haplotype mapping, Volume 3 and Article 74, Finding and using haplotype blocks in candidate gene association studies, Volume 4). Consequently, while there are in excess of 5 million SNPs in the human genome, there may be as few as 50 000 common haplotype blocks, and consequently it is argued that a similar number of markers will be sufficient to perform genome scans for association with disease (Risch and Merikangas, 1996). According to the common disease-common variant hypothesis, the polymorphisms that contribute to many complex human diseases are likely to have arisen early in human history, but sufficiently recently that they remain embedded in observable common haplotypes. Similarly, selected phenotypes or polymorphic traits of interest to evolutionary biologists and ecologists may be due to nucleotide variants that can be identified by LD mapping.

There is considerable debate over the reasons for the detection of haplotype blocks, with explanations ranging from sampling variance to unequal recombination rates and/or gene conversion hotspots within loci (Wall and Pritchard, 2003; Stumpf and Goldstein, 2003), and study of the population structure of haplotypes are in their infancy. With respect to evolutionary and agricultural genetics, measurement of haplotype structure is increasingly important. Domesticated crops and livestock are likely to have strong haplotype structure as a result of their breeding history (Flint-Garcia et al., 2003), whereas outbred and highly polymorphic species such as Drosophila melanogaster are almost devoid of haplotypes (see Article 10, Linking DNA to production: the mapping of quantitative trait loci in livestock, Volume 3). More recent is the advent of population genetics in nonmodel systems that are important with respect to epidemiology, particularly in humans, such as HIV and Plasmodium (malaria). The frequency of outcrossing or mixing among these species may contribute to these organisms’ ability to evade host immunity (Awadalla, 2003). The ability to dissect quantitative traits to the nucleotide level in any species is ultimately dependent on the thorough characterization of haplotype diversity.

3. Mutation, gene content, and the transcriptome

Population genomics also encompasses several novel aspects of variation that were beyond the technical reach of classical population genetics. For example, direct measurement of mutation rates is now possible, and will complement a large body of literature on the genetic consequences of mutation accumulation (Keightley and Lynch, 2003). For many species, it has been estimated that new genetic variance for fitness or morphological traits is generated at a rate within an order of magnitude of 0.1% of the environmental variance per generation (Clayton and Robertson, 1955; Houle et al., 1996). Similarly, genetic evidence suggests that a typical per locus spontaneous mutation rate is approximately 10-6 per generation, from which nucleotides are inferred to substitute in each meiosis at a rate close to 10-9. Microsatellites evolve at a much accelerated rate, but with a high variance, as directly measured by comparison of parent and offspring genotypes in several studies (Ellegren, 2000). Insertion-deletion (indel) polymorphism is prevalent, particularly in studies of regulatory regions of genes, but has been relatively neglected by theoreticians because of the absence of good molecular data on the tempo and mode of indel generation (Li, 1997). Genomic sequence data from invertebrates such as the nematode Caenorhabditis elegans that can be propagated essentially clonally (with a population size of 1) will provide measurements of mutation rates independent of the filter of natural selection (Vassilieva et al., 2000), offering a crucial comparison with standing variation in natural populations.

Gene order and content is unlikely to be highly polymorphic within populations of multicellular eukaryotes, but has emerged as a challenging feature of microbial genetics. A mixture of processes including conjugation, horizontal transfer from other species, plasmid shuffling, and spontaneous deletion or duplication, result in differences among congeneric bacteria affecting 10% or more of the genome (Ochman and Jones, 2000; Daubin et al., 2003; see also Article 66, Methods for detecting horizontal transfer of genes, Volume 4). Whole-genome sequence comparisons have revealed the existence of pathogenicity and virulence islands of genes that distinguish isolates of Bacillus, Escherichia, and several other bacterial species (Whittam and Bumbaugh, 2002; Hacker and Kaper, 2000), but more generally it has been suggested that each species is defined by a core set of definitive genes that are accompanied by hundreds of variable genes whose presence defines the metabolic capacity of each isolate (Lan and Reeves, 2001). Our conception of microbial diversity is under equally profound challenge through the advent of whole-flora shotgun sequencing, an approach designed to characterize new species that cannot be cultured in vitro (De Long, 2002; Venter et al., 2004). As many as 90% of the microbial species in water, soil, and body cavities remain to be described, and genomic arrays will also be developed for use in monitoring diversity in microbial ecosystems.

Finally, the structure of transcriptional variation is emerging as a new field of enquiry (see Article 90, Microarrays: an overview, Volume 4). Almost no attention has been given to the prevalence of variation for alternative splicing, despite the fact that mutational studies indicate that a considerable fraction of sites affect splicing efficiency in a quantitative manner (Cartegni et al., 2002). Transcript abundance itself is also variable among individuals, as a result of both environmental and genetic factors (Yan and Zhou, 2004). Estimates from half a dozen species indicate that at least 10% of the transcriptome differs in abundance between any two individuals, but almost nothing is known of the tissue and temporal specificity of differential transcription (Cheung and Spielman, 2002; Gibson, 2002). Descriptors of the frequencies of qualitatively distinct levels of transcript abundance as well as the cosegregation of these “transcriptional alleles” within and among populations, as well as their heritability, will be a fundamental component of future efforts to describe the genetic architecture of complex traits.

Next post:

Previous post: