Geographic structure of human genetic variation: medical and evolutionary implications

1. Introduction

This article is divided into three sections. In the first section we shall describe what we consider the main features of human geographic structure, that is, the fact that, on the average, two individuals living in the same geographic area are genetically more similar than two individuals living in different areas. Some common patterns observed at the global or local scale will be considered, and their possible origin will be discussed. In the second section, we shall summarize the consequences of the genetic structure currently observed in our species. Particular emphasis will be given to the known biomedical aspects, but the potential evolutionary implications will be also analyzed. Finally, in the third section, we shall discuss the possible future evolution of the human population structure, and how to predict it.

2. Human populations are genetically structured: how much, how, and why

Despite the recent appearance of our species and the ability and tendency to move not only to colonize new empty environments but also to exchange migrants among populations (see Article 4, Studies of human genetic history using the Y chromosome, Volume 1, Article 5, Studies of human genetic history using mtDNA variation, Volume 1, Article 2, Modeling human genetic history, Volume 1, Article 71, SNPs and human history, Volume 4) human groups from different geographic areas are often genetically distinct. The levels and patterns of this differentiation vary widely depending on the population, the genetic marker, and the sexes we consider. In humans, more than in other species, genetic structure is also unstable with time. Since the first colonization of the world by Homo sapiens, we probably never had a period of time long enough to establish a stable equilibrium of gene flow between local groups. Cultural and social changes, almost always associated with demographic effects, modified and continue to modify our population-genetic structure. But what is the situation today?

Roughly speaking, about 85% of the human genetic variation analyzed so far at the worldwide scale can be found, on the average, within a single population (Lewontin, 1972; Jobling et al., 2004; Barbujani, 2005; see also Article 1, Population genomics: patterns of genetic variation within populations, Volume 1). This figure is much higher than observed in several terrestrial mammals even at smaller geographic scales, where values smaller than 50% are not uncommon (e.g., Templeton, 1999). In other words, considering just this figure, population structure in humans appears limited: on the average, only about 15% of the genetic variation in our species occurs between populations, much less than observed, for example, in roe deer populations in Europe (Vernesi et al., 2002) or in common chimpanzee groups from different African forests (Kaessmann et al., 1999). This overall high level of population homogeneity can be explained by the history and the demography of humans: recent evolution, and high mobility. We all descend from some group of Eastern Africans, probably in the order of magnitude of 104 individuals (Takahata et al., 1995; Jobling et al., 2004) who dispersed and started to evolve in partially independent groups only in the last 60 000 years or so, and from which the current 6 billions of humans inherited their genes. However, the genetic structure of human populations would probably be stronger in a less mobile species. Genetic drift of allele frequencies due to founder effects during colonization or, in general, to small population size was probably the rule for much of our history. Only recurrent gene flow (mainly between neighboring groups) and massive migration processes (e.g., during the Neolithic transition) probably prevented a higher average divergence (at least in terms of allele frequencies) between human groups.

Fst, an index of genetic distance, can express differences between populations that measure the fraction of between-population variation over the total genetic variance. An average Fst of 15% can be considered low compared to other species, but still substantial since it indicates that human populations are, in fact, genetically differentiated. Only looking at specific populations, cultures, geographic areas, genes or markers, and time frames in more detail, we can try to understand the fine structure of human genetic variation and its implications.

There are many exceptions, but, as a rule, genetic variation tends to be geographically structured following a clinal or an isolation-by-distance (IBD) pattern. This means that the major patterns observed at large geographic scales can be described as a simple increase of Fst with geographic distance (Cavalli-Sforza et al., 1994; Relethford, 2004; Serre and Paabo, 2004). The major mass migration processes responsible for the clines are believed to be the Paleolithic out-of-Africa dispersal of anatomically modern humans, and the dispersal of Neolithic farmers from areas of origin of agriculture in all continents but Australia (Bellwood, 2004). On the other hand, local gene flow between neighboring populations can produce the IBD pattern, that is, a decrease of Fst with increasing geographic distance in any direction, up to a certain distance where genetic exchange is minimal. Isolation by distance is probably a continuing process in virtually all-human populations. Both IBD and directional dispersal produce continuous genetic change (CGC), which implies that (1) Fst values can be much higher than 15% when geographically distant populations are compared, and (2) the definition of a higher hierarchical classification, that is, the identification of distinct groups of genetically similar populations, is hard or not possible (Cooper et al., 2003). This last point, of course, does not mean that natural (sea, mountains, deserts, etc.) and cultural (language, tradition, behavior, etc.) barriers to gene flow do not create some level of genetic discontinuity between specific groups of populations. However, when different sets of markers are used to estimate the number of such groups or clusters, and to correctly reallocate genotypes to the cluster of origin, only broad continental groups (Africa, Eurasia, Oceania, and America) appear consistent across studies (Romualdi et al., 2002; Rosenberg et al., 2002; Bamshad et al., 2003). A recent study (Serre and Paabo, 2004) even suggests that the sampling bias (populations are usually sampled on the basis of an a priori idea about the groups) could explain better this result than the real presence of continental genetic barriers. Additional samples more evenly distributed in the geographic space are needed, but present data suggest that CGC is the rule, and genetic discontinuities in space are the exception: major genetic grouping, including races (Barbujani, 2005), is not an accurate concept to classify humans. CGC does not mean that Africans and Asian are not genetically different, but it does mean that this difference can be low or high, depending on the geographic distance of the populations considered and on the set of loci analyzed. Therefore, the genetic and evolutionary meaning of these groups is limited.

An additional feature of the human geographic structure is the occurrence of single populations that strongly deviate from the general patterns reflecting the 85/15 apportionment of diversity and the effects of CGC. Geographic and cultural barriers, which strongly reduce reproductive contact with neighboring groups, usually coupled with small population sizes, can result in highly differentiated populations, or genetic isolates. A genetic threshold to identify such specific population histories has not been defined. However, populations which clearly emerge from the 85/15 + CGC patterns when several loci are considered (e.g., in a multivariate analysis), and have therefore pairwise Fst values higher than 15% in comparison with most of the other populations, can be considered genetic isolates. Typical barriers implied in the known cases (Cavalli-Sforza et al., 1994; Jobling et al., 2004) are the sea (e.g., Sardinians, Papua New Guineans), language (e.g., the Basques, who are possibly a Paleolithic non-Indo-European relic), life style (e.g., hunter-gatherer as African Pygmies) sometimes combined with high geographic isolation (e.g., the Lapps). Other groups, initially identified as genetic isolates on the basis of single marker analyses or high incidence of genetically inherited diseases (e.g., Ladin speakers, or Ashkenazy Jews, or Finns), revealed only an average degree of differentiation when investigated at the genomic level. Only multilocus analyses can estimate the real level of genetic isolation, since stochastic errors or gene-specific selection processes may confound the pattern in single-locus studies. For example, Ashkenazy Jews were recently classified together with Norwegians and Armenians when 39 markers were considered (Wilson et al., 2001) and Finns, who are characterized by a specific set of disease alleles, do not appear to have a specific genomic make-up (Cavalli-Sforza et al., 1994; Jobling et al., 2004). Similarly, when genetic isolates are identified, it is important to distinguish between “pure drift” isolates, that is, those recently diverged small populations with reduced variability and almost no specific alleles or mutations, from populations with a long independent accumulation of specific molecular variation. The former can be very useful in mapping studies, but the latter contain certainly more information about the evolution of our species and our genomes.

The reduced fraction of genetic variation between populations, only about 15% on the average, is thus structured in a CGC pattern with outliers. This general view is mainly based on genetic polymorphisms such as mtDNA sequences, microsatellites, SNPs (see Article 71, SNPs and human history, Volume 4), Alu insertions, and classical electrophoretic markers, that is, on genome fragments, where selection is commonly believed to be either weak or absent. Most of the genome is constituted by regions of this kind (even though their possible regulatory functions are not completely understood). However, the geographic distribution of positively selected genes can be very different from the 85/15 + CGC + outliers predictions. Local adaptation and widespread (and similar) selective pressures produce, in fact, stronger and weaker geographic structure, respectively, and the genetic effects of selection can be very rapid (Hartl and Clark, 1997; Hendry and Kinnison, 2001). There is evidence, for example, that the mutation associated with the lactase persistence (that allows lactose digestion) in adulthood was selected in the last 10 000 years or so in pastoralist societies where milk drinking was an important part of the diet (Beja-Pereira et al., 2003; Coelho et al., 2005). This process produced an increase of frequency of the mutations associated with lactase persistence in these groups, and we observe now a higher population divergence (at this and probably at physically linked loci) when their descendents are compared with nonmilk drinkers. On the contrary, we expect that directional selection for the same allele in different populations, or balancing selection maintaining the same set of alleles in different groups, would produce lower-than-average population divergence. Genes showing low geographic structure, possibly because of balancing selection, includes HLA, CCR5, PTC genes (Cavalli-Sforza et al., 1994; Bamshad et al., 2002; Wooding et al., 2004). It is interesting to note that recent methods (Luikart et al., 2003), based on earlier ideas (Lewontin and Krakauer, 1973) reverse this approach; Fst is estimated for many DNA regions, and values falling in the upper and lower tail of the distribution are considered as suggestive of selective processes.

Finally, the fact that we have separate sexes has also an impact on human geographic structure. In a population with the same N/2 number of men and women, there are 2N copies of each autosomal DNA fragment, N/2 copies of each X-linked fragment, and N/2 copies of Y-linked and mtDNA fragments that can be transmitted to the next generation. In other words, random drift, which is negatively correlated with population size, is working at three different speeds in the same populations at these three classes of markers. Consequently, three different levels of population structure should be expected. In addition, the migration rate and the effective population size may differ in men and women. Most human populations are patrilocal, meaning that after marriage the men tend to stay in their birth place more than women, and our species was probably polygynous until recent times (Dupanloup et al., 2003), meaning that the number of transmitted Y chromosomes every generation was lower (and drift higher) than expected. Experimental data suggest that all these factors have been important, because the average level of geographic structure (say, Fst) in humans increases from autosomal to mtDNA markers (probably due to the different population size of the markers), and from mtDNA markers to Y-linked markers (presumably due to different male and female behavior at reproduction) (Jobling et al., 2004). The relevance of the sex-specific migration patterns in shaping the genetic structure of human populations has been recently confirmed comparing patrilocal and matrilocal tribes of Thailand, where reverse patterns in mtDNA and Y-chromosome markers have been found (Oota et al., 2001; Hamilton et al., 2005).

In conclusion, the present structure of human populations is characterized not only by some general rules and established patterns, which tell us something about the major processes that shaped genetic variation, but also by several exceptions. The goal for the future is to increase our knowledge and understanding of the fine structure studying isolated groups and selected genes.

3. Population structure in humans: what does this imply?

We discuss now some biomedical and evolutionary implications of the genetic structure described in the previous section.

3.1. Biomedical implications

The geographic patterns at neutral markers, mainly affected by demographic and historical processes, are in part known and understood (see previous section), and can be easily investigated more deeply typing more populations and more markers. But what do we know about the genetic structure of disease alleles? Theoretical predictions and available data can be helpful.

Dominant deleterious mutations are rare, because their expression in all individuals with at least one affected chromosome rapidly drive them to extinction. Simple monogenic diseases with this type of inheritance are therefore rare, with a similar incidence in different populations simply regulated by mutation-selection equilibrium. Similarly, infrequent are the simple genetic diseases whose heterozygous carriers enjoy some selective advantage also over the healthy homozygotes (e.g., sickle cell anemia). In the few clear situations of this type, the population structure at these genes can be decoupled from the structure observed at neutral markers. Environmental differences in the selective pressure (e.g., the geographic distribution of malaria for the sick cell anemia) are, in fact, responsible of the observed population structure. On the other hand, when the fitness effects of mutations are expressed only in the homozygotes, or after the reproductive age, or are, in general, very limited as in the complex multigenic diseases (see Article 57, Genetics of complex diseases: lessons from type 2 diabetes, Volume 2; Article 58, Concept of complex trait genetics, Volume 2), the major factors acting on their frequencies are the same that affect neutral markers: demographic and historical processes. For example, the high frequency of several genetic diseases in the Jews is a consequence of their isolation and small effective population size. In other words, the major patterns of population structure described in the previous section are expected to be very similar also in most simple genetic disorders and in the numerous susceptibility alleles involved in complex genetic diseases.

There seem to be six main reasons why population structure should not be ignored in biomedical studies:

1. Priorities in genetic testing for different diseases (see Article 69, Current approaches to prenatal screening and diagnosis, Volume 2; Article 83, Carrier screening: a tutorial, Volume 2) or in treatments should consider the differences among populations. Given certain symptoms, the most likely disease depends also on the population affiliation of the affected individual. If data on the disease incidence are not known, genetically similar population at neutral markers can be used as proxy. If a population is a genetic isolate, specific disease risk alleles are expected.

2. Isolated and recently founded populations (e.g., Finns, possibly Sardinians or Icelanders) should be preferred in mapping studies by linkage disequilibrium (LD) in unrelated individuals and for studies of multigenic diseases in general. These populations are, in fact, comparable to large families, where genetic heterogeneity is rare (and so affected individuals carry the same mutation) and recombination did not have the time to disrupt the statistical association between the hunted gene and the flanking markers. The possibility to find such situations is clearly related to the fact that human populations are, at least to a certain degree, geographically structured, and in particular, genetic isolates exist (see Section (2)).

3. The common disease - common variant hypothesis (see Article 59, The common disease common variant concept, Volume 2) suggests that only few susceptibility alleles, at high frequencies in different populations, are responsible for common diseases (Chakravarti, 2001; Reich and Lander, 2001). However, the empirical evidence for this hypothesis is scant (Jobling et al., 2004). Several causal alleles are probably rare and population specific (Kittles and Weiss, 2003; Tishkoff and Kidd, 2004), and can also be embedded in a block structure (the partition of the genome in high LD regions, see for example Daly et al., 2001; see Article 74, Finding and using haplotype blocks in candidate gene association studies, Volume 4) specific of a population (Verhoeven and Simonsen, 2004). Both these factors indicated that LD mapping should not be performed in pooled samples of individuals of different origins, and we expect that different studies in single populations should be able to better identify different loci with higher power.

4. Spurious disease-marker associations are expected if sampling is not stratified by populations. If the frequency of a trait (the disease) is higher in a population, which is also genetically distinct at some neutral markers, and the population is jointly analyzed with the others (either because sample sizes are limited, or because the two populations coexist in the same area), the trait and the markers will result statistically, but not genetically, associated. For instance, a wide-genome scan carried out in the United States for the genetic determinant of j-thalassemia, a well-understood monogenic disease typical of Mediterranean Europe, might lead to the conclusion that many genes cause the disease. This is because the affected children (most of them from Greek and Italians families) might differ at several genome regions from the controls (representing the general US population).

5. The presence of genetically different groups implies that admixture, if occurs, can produce several consequences. This process was not uncommon during the history and prehistory of our species, for example, between hunter-gatherer residents and immigrating farmers in the Neolithic, or more recently between Native Americans, Africans, and Europeans in the Americas. One practical consequence is that mapping by admixture linkage disequilibrium (MALD; see Article 76, Mapping by admixture linkage disequilibrium (MALD), Volume 4) is possible. This recently resurrected approach (McKeigue, 2005) exploits the transitory LD arising when distinct populations come together (of course trying to avoid the initially spurious correlations). From the point of view of the admixed population and individuals, genetic admixture produces a beneficial initial effect of heterosis (population-specific recessive deleterious alleles are not expressed in heterozygotes), and in general a decrease of the fraction of affected individuals at recessive disease alleles (since different allele frequencies result in less than average frequencies of homozygotes). On the other hand, outbreeding depression effects, although never demonstrated in humans, are possible in admixed populations. The frequency reduction of alleles positively and differentially selected in different environments (e.g., at the MC1R gene involved in pigmentation, or at the HLA system involved in immune response) may produce an increase of disease incidence in the admixed group. For example, skin cancer or vitamin D deficiency problems can increase when Caucasians and Africans mix, or infectious diseases can spread when populations adapted to different pathogen community come together. Also, coadapted combination of alleles at polygenic disease loci might be disrupted by admixture.

6. Exogenous molecules such as drugs should be absorbed and transported before acting, and metabolized and excreted after that. Several proteins are implicated during these phases, and their gene polymorphisms are related to differential efficiency (e.g., Meyer, 2004). On of the best-known examples is the gene CYP2D6, which encodes an enzyme that metabolize about 20% of the drugs in the market. Different genetic variants (that includes also variation in copy number) have different efficiency, where high and low efficiency result in reduced or toxic effects of the drug, respectively. The development of individual forms of pharmacological treatment based on the individuals’ genotype is clearly the long-term goal of applied pharmacogenomic studies, but it will not be a possibility in the immediate future. In the short term, it is unclear whether priority in drug treatments could be based on population affiliation, because different studies yielded very different results as for the degree of geographical structuring among populations for drug-metabolizing genes (Bradford, 2002; Shimizu et al., 2003).

3.2. Evolutionary implications

Adaptive genetic change is the ultimate consequence of environmental pressures affecting the individuals’ fitness, but often organisms respond to environmental change by physiological or cultural adaptation, the latter being very important in primates, and especially for humans. Cultural adaptation can allow reproduction of individuals who otherwise would have reduced or zero fitness, and can drive to accumulation of deleterious or maladaptive mutations. However, (1) fitness is still reduced in individuals affected by several inheritable disorders, or in individuals with maladaptive allelic combinations at important genes such as HLA, especially in underdeveloped countries; (2) only few genes responsible for local adaptations are known, but we expect that more, still under selection in some human groups, will be discovered in the future; (3) human phenotypes selected in mating, cultural and social contexts, for example “attractiveness” or “trust”, or “cooperation”, or “intellectual abilities”, or “novelty seeking”, may have a genetic component (see e.g., Ding et al., 2002; Evans et al., 2004; Kosfeld et al., 2005), and are probably still under selection; (4) evolution is not only natural selection, culture might reduce selective pressures, but might increase other evolutionary processes; for example, linguistic barriers decrease the level of gene flow thus favoring the genetic divergence between populations. In addition, massive migration is bringing new alleles into new areas, and causing either admixture or the onset of local reproductive barriers, depending on the circumstances.

It is clear, therefore, that the genetic composition of our species is still subjected to evolutionary change, and it makes sense to ask what are the predictable evolutionary consequences of the current geographic structure.

High genetic variation is regarded as a positive asset of a species, both because it is associated with low inbreeding depression effect, and because the evolutionary potential is maintained. Does the global level of genetic variation in our species depend on the fact that we are not a single panmictic group? Surprisingly, population genetics theory has no unequivocal answer to this apparently simple question. Different results are obtained with different assumptions on the population structure model and the demographic parameters, and using different metrics to quantify genetic variation (Slatkin, 1987; Strobeck, 1987; Whitlock and Barton, 1997; Wakeley and Aliacar, 2001). Nobody can tell to what extent human populations were subjected to processes of local extinction and recolonization (that are expected to reduce genetic variation) or alternatively represented rather stable demographic units (whose fragmentation is expected to enhance genetic variation). However, if the extinction of differentiated groups is not frequent, as expected in a recent and expanding species (Foley, 1998), geographic structuring should result in the existence of relatively high genetic divergence. If this prediction is appropriate for humans, the overall geographic structure, limited but significant, positively affected the level of genetic variation in our species. Partly different gene pools would have evolved in diverging groups, and only rarely would such gene pools disappear by population extinction. The expected end result of a process of this kind is a global variation larger than in a panmictic population of the same size.

Be that as it may, the current geographic patterning of genetic variation suggests at least two additional considerations: (1) the widespread presence of clines (CGC), and the consequent absence of major genetic barriers, indicate that a process of speciation, clearly unlikely today, was probably not a possibility also during the whole history of our species; and (2) the presence of highly differentiated groups, either outliers in clinal pattern or populations at the extremes of the clines, resulted in the localized occurrence of population-specific alleles. As a consequence, recent secondary contacts might result in the increase of infectious diseases, through a genetic out breeding depression effect (that can affect of course disease susceptibility as well as many other locally adapted traits) or in the diffusion of pathogens in immunologically naive populations. However, similar to what happened in local breeds of domestic animals or local varieties of plants, differentiated groups might have preserved specific and unique portions of adaptive genetic variation, which might be useful not only in evolutionary but also medical terms. An example of that is the small community of Limone sul Garda where most individuals are genetically protected against heart diseases (Bielicki and Oda, 2002).

Finally, it is interesting to note that genetic and cultural factors can reinforce each other. Linguistic and genetic diversity, for example, appear broadly correlated over much of the planet (Cavalli-Sforza et al., 1988), but this is not due to the existence of genetic factors predisposing people to speak certain languages. On the contrary, there is evidence that language barriers, much like geographic barriers, reduce gene flow, so that the existence of cultural differences results in an increased genetic divergence (Barbujani and Sokal, 1990). In a sense, cultural differences (language differences being just the simplest traits to analyze) would create an opportunity for sexual selection to operate (Harpending and Rogers, 2000). Thus, in various contexts, cultural difference might additionally increase because contacts are limited by differences in traits related with mate choice. If this is true, we expect that the enormous importance of culture in our specie could have resulted in some populations in a species-specific runaway process of cultural-genetic divergence favoring extreme cultural, and genetic, characteristics.

4. What is the future of human geographic structure?

Modern humans dispersed from Africa into Europe, Asia, and Australia (probably between 60 and 40 Ky ago), colonized the Americas (between 15 and 10 ky ago), and more recently (<3.5 ky) reached remote Oceania. All these events, with few exceptions, put the bases for the establishment of the geographic structure, in populations descended from small numbers of founders and hence highly subjected to random drift.

This process of genetic divergence was counterbalanced by the continuous local gene flow and by major dispersal processes, among which those associated to the agricultural transition (Bellwood, 2004). In different areas of Eurasia and of the Americas, technologies for food production were developed. The resulting increase in the farmers’ population size caused dispersal, which favored genetic mixing with previously, settled hunter-gatherer, and the reduction of drift effects. Of course, this process could have also increased the genetic divergence between human groups adopting different subsistence strategies, limiting the levels of gene flow among them. This might have contributed to the genetic isolation of some populations (e.g., Andaman Islanders, Pigmy), but the major effect of farming at global scale was probably the increase of genetic homogeneity (Excoffier and Schneider, 1999).

In very recent times, the level of population structure probably declined even more, because of the continuous but differential increase of the census size (that produced emigration from countries like Europe in the past or Africa now), the development of transportation means, and more in general the socioeconomical changes. Only in the last few centuries, millions of people from Africa, Europe, and America, relocated in different geographic regions the genomes inherited from their parents, a certain level of mixing with the local genetic pools occurred. In addition, contacts between populations with locally adapted variation for pathogen resistance (such as, e.g., HLA variants) might have produced global selective sweeps, decreasing the population structure also at these genes. Future advances in the biochemical methods to characterize the genes of past populations (e.g., Vernesi et al., 2004), and in the statistical methods necessary to analyze complex demographic scenarios (e.g., Beaumont et al., 2002; Marjoram et al., 2003) will definitely help us to better understand past evolutionary processes leading to the current structuring of the human populations.

If this reconstruction of the evolution of the human population structure is at least approximately accurate, should we simply predict that the “defragmentation” phase that followed the initial “structuring” process would continue? It is hard to imagine what will happen in the future, but a few educated guesses are possible. One is, large numbers of individuals are currently migrating to various areas of the world. However, it is not necessary to assume that large genetic changes will follow. Indeed, most documented historical migration processes that are considered to have contributed to homogenizing genetically human populations, occurred at low population densities. That is the case for all the main dispersal processes leading to the farming expansions, as well as for expansions occurring in the last few centuries. However, at the high densities, typical of modern urban societies, very high numbers of immigrants are necessary to produce significant changes in allele frequencies. In addition, the local effect on population structure will depend not only on the rates of gene flow, but also on the tendency of people to admix, once they are settled in a new region. Examples from the past range from extensive admixture in countries such as Brazil (see e.g., Parra et al., 2003) to long-term coexistence of reproductively isolated communities in countries such as India (see e.g., Bamshad et al., 2001) or the US (see e.g., Shriver et al., 1997), with many intermediate situations. In some cases, a high level of structuring within metropolitan areas is to be expected, with ethnic or linguistic communities living within a few hundred meters of distance showing levels of genetic differentiation comparable to those between subcontinents. Cultural barriers, and not mountain chains or seas, might thus lead to the evolution of a secondary phase of structuring in humans.