The common disease common variant concept (Genetics)

1. Introduction

Unraveling the genetic basis of human diseases represents a major challenge in human genetics. This is especially the case for complex diseases, which comprise the bulk of the disease burden in industrialized societies. Complex genetic disease traits arise as a consequence of genetic and environmental contributions to disease susceptibility (see Article 58, Concept of complex trait genetics, Volume 2). The genetic component is split across many loci, each contributing a small effect to the overall susceptibility. Thus, the identification of the causal genetic variants in these diseases presents a major challenge to human geneticists. The problems of achieving this goal are further magnified by the complexities that result from the gene-gene and gene-environment interactions.

Linkage mapping has been used to great effect with simple Mendelian disease traits, such as cystic fibrosis (Pritchard and Cox, 2002) where genetic variants in a single gene increase the risk of disease dramatically. The nature of complex traits is such that the identification of disease susceptibility genes has relied heavily on association methods. Whether of case-control design, or family-based design, the successful outcome from association studies is dependent on the presence of linkage disequilibrium (LD) in the genome. When a novel variant first appears, it will create a new haplotype, the constituent markers of which will be coinherited for a number of generations. During this stage, it is possible to predict the identity of the other alleles by identifying the allele at one marker on the haplotype. It is this genetic predictability that is exploited in LD mapping, before recombination events occur, which will, over time, cause the overall pattern of LD in this new haplotype to start to break down, thereby reducing the predictable nature of the pattern of inheritance. The pattern of LD across the genome, therefore, has a major impact on the interpretation and design of association studies. The efficacy of an association study to identify a given disease-susceptibility locus is also influenced by the number of alleles at that locus that contribute to disease risk. How many susceptibility alleles there are at any given disease locus depends on their allele frequencies and the allelic heterogeneity. The number of different variants present within a locus that predispose to a disease phenotype may be referred to as its allelic diversity or the allelic spectrum. The term genetic architecture has been used to refer to the range of allelic diversity in the genome that will contribute to a phenotype. Obviously, the genetic architecture will have an impact on the statistical power of mapping studies to detect positive associations. In single-gene disorders, the allelic spectrum can be wide, but the penetrance of the alleles is maintained. In polygenic disease traits, the complexity of the allelic spectra and the genetic architecture will influence the feasibility of identifying disease susceptibility alleles. A large number of loci, each with complex allelic spectra would severely compromise the probability of finding susceptibility alleles in polygenic disorders. Success would require both larger study cohorts and very large numbers of polymorphisms to be typed (Wang and Pike, 2004). However, it can be argued from a theoretical standpoint that more common allelic variants may predispose to complex polygenic diseases: the common disease-common variant hypothesis. Preliminary data from association studies in complex traits suggest that commoner variants do make some contribution to disease. The arguments behind the common disease-common hypothesis will be described and some examples provided. We will also discuss alternative models and suggest that the true nature of the architecture of genetic disease will reflect more than one single concept.


2. The concept of genetic architecture

When considering the penetrance and number of alleles that predispose to disease, it is useful to consider the influence that quantitative variation in these parameters will exert on the ability to find and identify causal disease polymorphisms (see Article 10, Measuring variation in natural populations: a primer, Volume 1). Genetic disease can be divided into monogenic disorders and polygenic or complex disorders. Monogenic diseases are usually rarer and result from the inheritance of highly penetrant genetic variants at one or a very limited number of loci. More than 1500 monogenic disease genes have been identified, and it is well established that the majority of such diseases exhibit marked allelic diversity (On-line Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). The degree of allelic heterogeneity in any population will reflect a balance between the processes of spontaneous mutation and selection against the inheritance and propagation of any allele. The rate of spontaneous mutation in the genome is highly variable: estimates vary from 10-4 to 10-7 per locus per generation (Bellus et al., 1995; Crow, 2000; Eyre-Walker and Keightley, 1999). There is no compelling evidence, however, that simple Mendelian disorders arise in genes that exhibit particularly high-mutation rates. However, in comparison with complex genetic conditions, monogenic diseases are more likely to exert some influence on survival, and hence, selection is more likely to operate in a population in relation to mono-genic disease when compared with complex genetic disease. How might the allelic diversity in monogenic disease be explained? It is imperative in the consideration of such issues to incorporate the fact of the rapid expansion of the human population over the preceding millennia. The exact number of millennia is open to debate; best estimates suggest that this population explosion has occurred over the preceding 700-6000 generations; that is, in the range of 18 to 150 millennia. This impressive demonstration of fecundity has had a major impact on the genetic architecture of the modern human population. The dramatic increase in population size would be expected to increase the numerical representation of disease alleles, even rarer ones, in modern humans. However, two factors will serve to reduce this effect and lead to marked allelic diversity in monogenic disease. First, as new mutations are generated they are less likely to arise on a preexisting disease-associated haplotype, if that haplotype itself is uncommon. Thus, for rarer disease alleles, the accrual of mutations over generations is likely to simply expand the number of disease susceptibility alleles. Second, the presence of selection will serve to reduce the frequency of causal alleles over time.

Not all monogenic diseases demonstrate marked allelic diversity. Several factors may explain this situation. Most simply is the case of heterozygous advantage. In this scenario, alleles whose pathologic effect is biased toward recessivity can confer a selective advantage when present as a single copy. There are several well-characterized examples of polymorphisms that exhibit some degree of protection from malaria, for example, G6PD deficiency at the G6PD locus, and HbC and HbS at the HBB locus in West African populations; the estimated allele frequencies being 0.20, 0.09, and 0.10, respectively. Selection may operate during the phase of population expansion, and it may increase the frequency of certain alleles in the population in the preexpansion phase. For example, the mutations that give rise to cystic fibrosis (CF) do not immediately appear to comply with the arguments set out above. More than 900 alleles of the CF transmembrane conductance (CFTR) gene have been associated with CF, however, approximately 70% of cases are due to one single deletion, AF508. It has been postulated that heterozygotes carrying CFTR mutations may have some selective resistance to Salmonella typhi (Pier et al., 1998). Irrespective of the mechanism, it can be shown that for an allele as frequent as AF508 in the preexpanded population this simple ancestral spectrum will persist following expansion with a half-life of 39 000 to 390 000 years, depending on the assumed mutation rate (Reich and Lander, 2001).

Do the effects of recent population expansion differ in complex traits when compared with monogenic disorders? If one assumes that the frequency of the disease susceptibility alleles in more common complex genetic traits is higher than in rare monogenic disease, then essentially the answer is yes. And not only is the answer in the affirmative, but it was shown by Reich and Lander that this situation will remain so for tens if not hundreds of thousands of years, depending on the frequency of the disease allele. Unsurprisingly, more common alleles persist at a similar frequency in an expanded population for longer. The implications of this model have given rise to the common disease-common variant hypothesis.

3. Current hypotheses to explain genetic architecture

3.1. Common Disease Common Variant (CDCV) hypothesis

The CDCV, or interactive, model of genetic architecture was first proposed in 2001 (Reich and Lander, 2001). It states that complex diseases are caused by the interaction of common alleles at a small group of susceptibility loci (Smith and Lusis, 2002; see also Article 57, Genetics of complex diseases: lessons from type 2 diabetes, Volume 2). These common alleles are not population specific, but are present at >1% minor allele frequency in multiple populations.

By taking into account the rapid, recent expansion of the human population, as described above, it is apparent that the allelic spectrum at all loci will gradually increase with time. However, the frequency of the allele in the preexpansion phase is critical to the rate at which allelic diversity is generated. Mathematically, the main influences on the allelic spectrum, in the absence of selection, can be summarized as a balance between the forward mutation rate, 4Neus, and the reverse mutation rate, 4Neun, where Ne is the effective population size (usually estimated to be 10 000) and us is the mutation rate per generation per locus, the n subscript denoting the reverse mutation rate. That is, in any stable population, where the genetic variants are in equilibrium, there is a balance between the rate at which new mutations arise, which for most genes is at a rate of approximately 1 x 10-5 – 1 x 10-6 per gene per generation (Bellus et al., 1995; Crow, 2000; Eyre-Walker and Keightley, 1999; Peltomaki, 2001; Sankaranarayanan, 1998) and their loss by negative selection. However, in a rapidly expanded population, rare alleles in the original founder pool are reduced in frequency as the result of strong selective pressure, and will become swamped by the generation of new alleles (Reich and Lander, 2001). By contrast, common alleles, which have a large population reservoir, and are under little selective pressure, have a lower turnover, so will take longer to be diluted out (diversified) by the presence of these new mutations (Smith and Lusis, 2002). New disease-susceptibility mutations might arise but they would still be carried on the ancestral haplotype. The higher the initial allele frequency, the longer this will take (Wright and Hastie, 2001). Allele diversification will continue in this way, until the alleles reach equilibrium in the expanded population. Reich and Lander state that the diversification of the alleles in the expanded human population is in an intermediate stage, so that these selectively neutral common alleles that were important markers of disease in the preexpanded population are predicted to have retained a low level of allelic diversity in the postexpansion human population. Therefore, they can still be useful markers of common disease in the modern human population (Smith and Lusis, 2002). These concepts are illustrated in Figure 1.

Another important factor in formulating the CDCV hypothesis is the strength of the selection pressure on allele diversification. For alleles to attain a high-equilibrium frequency in a population, that is, retain low diversity and be useful markers in LD mapping of complex diseases, they must be selectively neutral or be under low-selective pressure. This leads to a lower turnover in the population and decreases the effect of new mutations. However, selectively neutral alleles are likely to make a small contribution to the overall disease risk, giving weak associations to disease, which will be another factor in the requirement for large samples size. This situation might be the case for alleles implicated in disorders such as for late onset diseases such as type 2 diabetes (T2D) or hypertension (Wright and Hastie, 2001). Furthermore, it is necessary to take into account stratification issues when interpreting association study results. An allele may be at a high frequency in a subpopulation, due to urbanization or geographical isolation, so a positive or negative association may not be applicable to the population as a whole. The influence of the rapid expansion of human populations may also be seen in relatively isolated populations. In such circumstances, there may be a marked influence of the chromosomal composition of the founders of that population. Indeed, as can be seen in Figure 1, if this “founding” effect occurred relatively recently, then even low-frequency disease-susceptibility alleles may be disproportionately represented in the contemporary population.

This Figure is taken from the paper by Riech and Lander. It illustrates the change over time accompanying the human population expansion of the frequency of the disease-susceptibility alleles, f 0. How the fraction of the total proportion of disease-susceptibility alleles changes with time is shown in (a). The curves are calculated assuming a population increase from 105 to 6 x 109 and a fixed mutation rate (p) of 3.2 x 10-6 per generation. As is shown, more frequent classes of disease alleles will have persisted over the preceding 100 000 years and shown only a slight reduction in frequency. The same point is illustrated in (b). In this graph, the time taken for the original disease allele class frequency to fall by half is shown. Thus, for a disease allele frequency of 0.005, the frequency will have fallen to 0.0025 after approximately 20 000 years. It can also be seen that at higher-mutation rates the half-life is substantially shorter. The probability that the two alleles randomly picked both belong to the disease-susceptibility class is denoted by ^disease. Hence, the reciprocal of this term is an index of the allelic spectrum of any disease. In (c), an arbitrary threshold for simpler disease spectra is shown by a cutoff of 1/^disease = 10. The simpler disease spectra are maintained for at least 100 000 years with disease allele frequencies between 0.01 and 0.05

Figure 1 This Figure is taken from the paper by Riech and Lander. It illustrates the change over time accompanying the human population expansion of the frequency of the disease-susceptibility alleles, f 0. How the fraction of the total proportion of disease-susceptibility alleles changes with time is shown in (a). The curves are calculated assuming a population increase from 105 to 6 x 109 and a fixed mutation rate (p) of 3.2 x 10-6 per generation. As is shown, more frequent classes of disease alleles will have persisted over the preceding 100 000 years and shown only a slight reduction in frequency. The same point is illustrated in (b). In this graph, the time taken for the original disease allele class frequency to fall by half is shown. Thus, for a disease allele frequency of 0.005, the frequency will have fallen to 0.0025 after approximately 20 000 years. It can also be seen that at higher-mutation rates the half-life is substantially shorter. The probability that the two alleles randomly picked both belong to the disease-susceptibility class is denoted by ^disease. Hence, the reciprocal of this term is an index of the allelic spectrum of any disease. In (c), an arbitrary threshold for simpler disease spectra is shown by a cutoff of 1/^disease = 10. The simpler disease spectra are maintained for at least 100 000 years with disease allele frequencies between 0.01 and 0.05

Figure 1 (continued)

Figure 1 (continued)

The CDCV model predicts that there is an increased risk for common diseases from common variants of low individual risk, across multiple populations. Three examples of common alleles supporting this hypothesis are: a 32-bp deletion in the coding region of the CCR5 gene, which decreases the transmission of HIV-1, has an allele frequency of at least 9% in Caucasians (Wright and Hastie, 2001), and is common in a number of other populations (Lu etal., 1999); the E4 allele of Apolipoprotein E (APOE), which has a prevalence of 15% in Caucasians, over 20% in Africa and Scandinavia, and 6-12% in Japan and China (Ritchie and Dupuy, 1999) has been associated with Alzheimer’s disease (Saunders etal., 1993) and coronary artery disease. A common variant, P12A at the PPARy locus is associated with type 2 diabetes (Altshuler et al., 2000). The causal allele is at high frequency in the population (0.85) and as expected the risk associated with it is small. The genetic basis of complex disease is insufficiently established, however, to provide full empirical support to the CDCV hypothesis. It should also be appreciated that association studies have a bias toward identifying more frequent variants.

3.2. Multilocus-multiallele hypothesis

However, the CDCV hypothesis does not provide the complete answer to the nature of the genetic architecture of human populations. The CDCV hypothesis states that the common diseases are due to common alleles, but the multilocus-multiallele (MLMA) or genetic heterogeneity hypothesis describes complex diseases as having a low probability of carrying a given susceptibility allele, given the fact that these diseases have a contribution from multiple susceptibility alleles and environmental influences. This gives a low chance, or low detectance, of detecting a single deleterious allele during an association study (Weiss and Terwilliger, 2000).

Furthermore, the high frequency of some complex diseases could be due to the prevalence of environmental triggers that form part of the natural aging process. For some complex diseases, for example hypertension, there is a high-level contribution of environmental/natural aging factors implicated in the disease process, since the heritability decreases with increasing age of onset, due to an increasing contribution from the aging process and other environmental factors.

One more factor supporting the MLMA hypothesis is a predictor that low-frequency variants have a role to play in the genetics of a disease, rather than the high-frequency variants proposed by the CDCV hypothesis. This predictor arises from the inverse relationship between allele frequency and genetic effect (Morton, 1996; Wright, 1968) and proposes that the lower the allele frequency, the higher the genetic effect. Evidence from breast cancer and familial hypercholesterolemia (FH) are two examples that support the idea of rare alleles being important risk factors for disease. In the BRCA2 gene, there are greater than 404 rare alleles with increased risk of disease, but only 1 out of 6 common alleles have any effect. Likewise, in FH, there is an increased risk of premature coronary artery disease attributed to more than 435 rare alleles in LDL receptor, but no known common alleles (Wright and Hastie, 2001).

These examples also serve to raise the question over what proportion of weakly associated, selectively neutral common alleles actually have a role to play in the genetic aetiology of complex diseases. Since it has been proposed that neutral alleles either tend to be lost to a population or be fixed in that population (Pritchard and Przeworski, 2001) and common alleles are thought to attain their high-allele frequency by virtue of being selectively neutral, they are unlikely to have a significant effect on risk of disease (Pritchard and Przeworski, 2001).

3.3. Selection in complex traits

As may be predicted by the CDCV hypothesis, the E4 allele of APOE appears to be the ancestral allele (Fullerton et al., 2000) and is not uncommon the population. The situations at PPARy locus and in the VNTR alleles at the INS (Insulin) locus are analogous. Why should disease-susceptibility alleles be so prevalent in the population? The model of recent population expansion predicts their high frequency in the ancestral population. It could be argued that their individual disease risk is small, often complex disease traits pertain to diseases of adult onset that may impact less on selection. It is also possible that there are some selective advantages to alleles that predispose to complex traits (Fullerton et al., 2000). It was suggested more than 40 years ago that diabetes mellitus might be so common in the modern world as a consequence of the selection of alleles conferring metabolic efficiency on the bearer (Neel, 1962). Similarly, it can be argued that autoimmunity may arise as a consequence of the selective pressures imposed by infection.

3.4. Common variant/multiple disease hypothesis

The CDCV hypothesis also lends itself to extension, raising the possibility that common disease variants may predispose to multiple diseases. That is, if a restricted number of common genetic variants predispose to a disease state, would this mechanism be disease specific? The overlap between many complex disease traits is well established at an epidemiological level, although often less well understood pathologically. It would not be surprising to learn that coronary artery disease (see Article 63, Hypertension genetics: under pressure, Volume 2), obesity, and type 2 diabetes shared some common aetiology. It is apparent that multiple autoimmune diseases appear to be more prevalent in some families (Broadley etal., 2000). Indeed, there is genetic evidence to support the concept of shared genetic risk factors in different, but related disease states. For example, it has been shown in autoimmune diseases that the loci mapped in genome-wide linkage studies tend to overlap more than would be anticipated by chance. This observation has also been made in animal models of autoimmunity (Becker, 2004). Similar observations have been made in schizophrenia and in bipolar disorder, and in type 2 diabetes and obesity. Linkage analysis defines large genomic regions that do not necessarily harbor identical disease alleles. The lack of well-substantiated disease associations in complex traits does not allow clear support of commonality of genetic factors. However, there are some tantalizing clues. The E4 allele of APOE has been cited in support of the CDCV hypothesis, above, the association with Alzheimer’s disease and coronary artery disease suggest the possibility of shared genetic factors, a conclusion strengthened by the association of E4 with other related disease states (Smith, 2000). Similarly, CTLA4 polymorphisms have been reported to be associated with multiple autoimmune diseases including type 1 diabetes, multiple sclerosis, and autoimmune thyroid disease. The heterogeneity of the published data and the complexity of CTLA4 haplotypes are such that it is unclear at present whether these CTLA4 associations represent the influence of a common or restricted number of causal variants.

4. Conclusions

Much of the progress in LD mapping of complex diseases has been made using the major assumptions of the CDCV hypothesis, that is, that common alleles cause common diseases. However, the results are often inconclusive, with reports of negative or borderline associations, and positive associations, which are frequently population dependent. There are many reasons for these findings. There is probably an overall lack of power in the individual studies, so that association is biased toward the identification of more common, but weaker haplotypes. However, the failure to find widespread, reproducible association using LD mapping is not a reason to discard the CVDV model as a working hypothesis. The CDCV model does not have to be universally applicable to justify its use in LD mapping studies (Goldstein and Chikhi, 2002). However, the results of such studies need to be interpreted with caution. If negative results in an LD-based mapping study, that is, no common alleles are associated with disease, then it is worth going back to the data and looking for rare variants. For positive associations with common alleles, it will be necessary to first gain replication of the results in an independent sample set and then look for rarer SNPs, with potentially greater penetrance, which may have been overlooked in the initial study. It is likely that the true genetic architecture includes a combination of both rare variants and common alleles, potentially in the same locus, making a genetic contribution to disease, since that is the pattern of allele frequencies across the rest of the genome (Wang and Pike, 2004). Furthermore, it seems unlikely that there are multiple common alleles interacting with one or more common environmental triggers to cause common diseases and that these disorders are not even more prevalent. One solution to this problem is that common diseases require an additional contribution from rarer, more penetrant alleles, with increased disease risk to “tip the balance”. For the future, the full potential of the CDCV hypothesis and LD mapping is unlikely to be realized unless the contribution of rare alleles is assessed through fine mapping in large collaborative projects.

Next post:

Previous post: