Haplotype mapping (Genomics)

1. Introduction: linkage, association, and linkage disequilibrium (LD) mapping

Linkage mapping is based on coinheritance of large chromosomal stretches among members of a family sharing a common phenotype. This approach has been very successful for rare Mendelian diseases, but it has led to mixed results for common diseases (Lander and Kruglyak, 1995). Many factors, such as low penetrance, small sample size, different diagnostic or disease definitions, and uncertainty in the position of the linkage peak have contributed to the low success of family based genome scans. Association approaches can directly test functional polymorphisms or indirectly test mutations using genetic markers in linkage disequilibrium (LD) with the mutation. Association and LD mapping traditionally compare samples of cases and controls, but alternate strategies exist, such as the comparison of transmitted alleles inherited by an affected offspring with the untransmitted alleles of the parents. Association methods have greater power to detect small and moderate genetic effects than linkage analysis and, thus, are more suited for the identification of variants predisposing to common diseases (Risch and Merikangas, 1996). However, it is not currently feasible to perform association scans at the genome scale, and a mixed approach is typically used (see Table 1) (see also Article 11, Mapping complex disease phenotypes, Volume 3 and Article 17, Linkage disequilibrium and whole-genome association studies, Volume 3 for more details on linkage and association mapping of common diseases).


While microsatellites were the markers of choice for linkage studies because they are highly informative, they cannot be used reliably for association studies because of their relatively high mutation rates. Single nucleotide polymorphisms (SNPs) are more stable and more abundant than microsatellite markers, enabling investigators to cover extensively any genomic region (Risch, 2000). There are estimates of 10 to 15 million common SNPs having a minimal minor allele frequency greater than 1% in the human population (Kruglyak and Nickerson, 2001). Using population simulations to estimate LD, Kruglyak (1999) hypothesized that the extent of useful levels of LD for association studies in the general outbred population is about 3 kb, implying that about 500 000 SNPs would be needed for a genome-wide association scan. With today’s cheapest technology, it would be unfeasible to type them all in a typical case-control study. However, the rate of the LD decay is not uniform throughout the genome. Several studies observed that the local decay in LD is not continuous at all and can sometimes extend for very long distances up to 1-3 cM (Lonjou etal., 1999, Mohlke et al., 2001, Service et al., 2001). LD is known to extend for longer distances in isolated populations (such as populations from Iceland and Finland), or founder populations that experienced a rapid expansion (such as the Amish, or populations from Quebec and Newfoundland), and one might expect that this property should facilitate disease mapping (reviewed in Peltonen, 2000). While the use of isolated or founder populations helped in the identification of single mutations for rare diseases because it reduces allelic heterogeneity, their usefulness for common diseases is still debated (see Kruglyak, 1999).

Table 1 Examples of haplotype mapping studies

Disease No. of cases/families (population) No. of SNPs (density) Gene/haplotype identified Region Haplotype frequency Replication panel References Replication in independent study
Type 2 diabetes 110 cases (Mexican Americans) 114 (1/15 kb) CAPN10 2q37 0.23/0.32 192 cases (Finnish) Horikawa et al. 2000 Meta-analysis of 17 studies (Weedon et al.. 2003)
Crohn disease 139 trios (Canadian) 301 (1/3.3 kb) 250 kb

haplotype in the cytokine cluster

5q31 0.37 88 trios (Quebec) Rioux et al. 2001 368 German trios

(Giallourakis et al.. 2003)

Asthma 130 cases (UK/US) 135 (1/18.5 kb) ADAM33 20pl3 (23 genes) 0.6 460 families (UK/US) Van Eerdewegh et al. 2002 Werner et al. 2004
Schizophrenia 476 cases (Icelandic) 181 (1/8.3 kb) NRG1 8pl2-p21 0.075 Stefansson et al. 2002 609 Scottish

cases

(Stefansson etal.. 2003)

Asthma 80 nuclear families (Australian) 54 (1/14 kb) PHF11 13ql4 0.08/0.4 237 cases (British) Zhang et al. 2003
Asthma 244 families (Australian/ British) 82 (1/5.6 kb) DPP10 2ql4 0.3 270 cases (British) Allen et al. 2003
Susceptibility to leprosy 197 trios (Vietnamese) 81 (l/6kb) PARK2/PACRG 6q 0.31 587 cases (Brazilian) Mira et al. 2004
Myocardial infarction 779 cases (Icelandic) 48 (1/1.5 kb) ALOX5AP 13ql2-13 0.095 753 cases (British) Helgadottir et al. 2004

2. Haplotype blocks

Recent studies using relatively high densities of SNPs enabled the analysis of the LD structure at smaller scales. It was observed that LD decays in sudden discrete drops to form blocklike structures that can extend from a few kilobase pairs to more than 100kb (Daly etal., 2001, Reich etal., 2001, Patil etal., 2001). Moreover, these blocks seem to have very low diversity, pointing to the fact that most of this diversity comes from a small number of ancestral chromosomes. The specific set of alleles on a chromosomal segment is called a haplotype. New haplotypes can arise from new mutations or recombination. Each novel mutation is initially associated with alleles that happened to be present on a particular chromosome on which it arose (thus establishing a novel unique haplotype). These alleles will remain correlated until recombination events occur between them. The existence of recombination hot spots was suggested as an explanation for the presence of these large blocks of low haplotype diversity (or haplotype blocks) (Daly et al., 2001). Low haplotype diversity implies that only a few SNPs per haplotype block would need to be typed in order to extrapolate most of the information of the remaining SNPs, which could save considerable time and money.

For example, Rioux etal. (2001) applied an LD approach to the investigation of a Crohn disease susceptibility gene linked to chromosome 5q31. In a region spanning 250 kb encompassing several cytokine genes, they observed over one dozen SNPs that are associated with the disease. Daly etal. (2001) showed that the region can be subdivided into 11 haplotype blocks, each with only two to four haplotypes describing more than 90% of haplotype diversity at this chromosomal region (Figure 1). Had the underlying block structure been previously known, they could have genotyped as few as 20 SNPs (termed haplotype tag SNPs or htSNPs) to test the entire 250 kb. One risk-associated haplotype was identified, but because of the existence of large haplotype blocks and the presence of considerable LD between them, the causative polymorphism (or even the causative gene) cannot be identified using genetic mapping alone. In order to identify which gene or which polymorphism is contributing to the disease, functional studies, such as expression analysis, targeted mutations in vitro, or knockout studies in mice are necessary. Using a different population might also be helpful. Oksenberg et al. (2004) used a cohort of African Americans to define a haplotype in the HLA region that is associated with multiple sclerosis. The African haplotypes in this region do not display the high degree of LD and haplotype structure observed in Caucasians. This enabled the research team to reduce the region to only one gene, DRB1, excluding DQB1. The fact that many disease-associated SNPs located in the vicinity of one or more genes can be present within a haplotype block, and that LD can be detected across blocks, clearly illustrates the importance of understanding the underlying haplotype structure of a region in the population studied. Using Alzheimer’s disease as an example, several studies showed that the association between the ApoE gene and the disease could be identified by using haplotype information even though the true variant was not genotyped (Martin et al., 2000, Fallin et al., 2001).

Haplotype structure at the IBD5 locus. (a) Common haplotype patterns in each block of low diversity. Dashed lines indicate locations where more than 2% of all chromosomes are observed to transition from one common haplotype to a different one. (b) Percentage of observed chromosomes that match one of the common patterns exactly. In addition, four markers fell between blocks, which suggests that the recombinational clustering may not take place at a specific base-pair position but rather in small regions.

Figure 1 Haplotype structure at the IBD5 locus. (a) Common haplotype patterns in each block of low diversity. Dashed lines indicate locations where more than 2% of all chromosomes are observed to transition from one common haplotype to a different one. (b) Percentage of observed chromosomes that match one of the common patterns exactly. In addition, four markers fell between blocks, which suggests that the recombinational clustering may not take place at a specific base-pair position but rather in small regions.

Different analytical approaches can be used to construct a comprehensive LD/ haplotype map of a selected region. Several methods have been described to identify haplotype blocks and will be explained in more details (see Article 74, Finding and using haplotype blocks in candidate gene association studies, Volume 4). A color-coded pairwise D’ map, constructed with programs such as GOLD (Abecasis and Cookson, 2000), is also very useful to visually examine regions of high and low LD (see Article 73, Creating LD maps of the genome, Volume 4). Finally, association can be measured using multilocus haplotypes instead of individual SNPs, resulting in an increase in power to detect indirect association and in a decrease in the number of tests performed, thus minimizing the correction applied to the p-value (Zhang et al., 2002a). Several studies have successfully mapped an associated gene or haplotype block to a common trait by using a dense set of SNPs to understand the haplotype structure. Some examples are listed in Table 1.

Although recombination hot spots have been observed directly in other species (yeast, Drosophila, mouse), it is labor-extensive in humans. Jeffreys et al. (2001) used sperm-typing to directly observe meiotic events and infer the existence of recombination hot spots in humans. They observed that 94% of the crossovers occurred at only six positions in the 300 kb HLA region on chromosome 6p. These hotspots for recombination are all between 1 and 2 kb in length. In yeast, hot spots tend to be associated with GC-rich chromosomal regions and form open domains in meiotic chromatin that allow access to the recombination machinery (reviewed in Petes, 2001). However, in humans, no similarity in the primary genomic sequence has yet to be predictive of the hot spots. Recombination hot spots have also been observed in the pseudoautosomal region of chromosome Y (Lien et al., 2000, May et al., 2002) and in the j-globin gene cluster (Schneider et al., 2002, Smith et al., 1998). It is still unknown whether these observations are representative of the recombination events in the remaining genome and for all populations. Kauppi etal. (2003) showed that although haplotype diversity was different in the HLA region between European and African populations, patterns of LD were similar. We note, however, that the average recombination rate is 1.65 times higher in females than in males, pointing to the fact that mechanisms involved are not homogeneous (Kong et al., 2002).

Gabriel etal. (2002) extensively genotyped 50 autosomal regions spanning 13 Mb of sequence and found that, on average, haplotype blocks extend for 22 kb in Asian and European populations and 11 kb in the Yoruba population from Nigeria. Using one marker every 7.8 kb, they observed that about 50% of the sequence was in blocks. Using denser sets of SNPs, the fraction of sequence included in blocks will increase but it will be accompanied by a drop in the average block size as additional blocks of smaller size are identified (Ke et al., 2004). Gabriel et al. (2002) defined blocks as regions in which 95% of the marker pairs show no strong evidence of historical recombination, the latter being based on confidence intervals between D’ values. This accounts for some uncertainties, such as finite sample size and unphased diploid data. They observed that block boundaries between populations are extremely similar. This is in agreement with the recombination hot spot theory and it demonstrates that a blocklike model can be applied to the whole genome and to other populations as well. Other studies interpreted similar results differently, explaining that the majority of the blocklike patterns could be explained by population history (Patil et al., 2001, Phillips et al., 2003). However, the presence of the longest blocks had to be explained by some other mechanism, such as recombination hot spots. Both interpretations, population history and recombination hot spots, clearly play a role in shaping the genome but their relative contribution is still debated today.

Many different definitions of haplotype blocks have been published, and it is known that allelic frequency and SNP density will affect to some degree all block definitions (Zondervan and Cardon, 2004, Ke etal., 2004, Schulze etal., 2004). Denser genotyping of SNPs will reveal more blocks, shorter in length, while low allelic frequency SNPs will tend to break existing blocks into shorter ones. Also, since recombination does not occur all the time in recombination hot spots (i.e., LD can exist between adjacent blocks and recombination can occur inside blocks), block boundaries will vary between definitions. For these reasons, it is unclear yet how to compare haplotype blocks between studies or whether the results reflect some underlying biological processes. For a review of the methods to identify and define haplotype blocks, see Article 74, Finding and using haplotype blocks in candidate gene association studies, Volume 4. Because the Gabriel et al. (2002) approach is somewhat more easily understandable to biologists and because it does not need phased haplotype data, it has been more widely used up to now.

3. Building a haplotype map

Following these pilot studies, it became apparent that building a human haplotype map that would describe about 80-90% of the genome that is included in blocks and representing most of the human diversity would be extremely valuable. It was recognized that the existence of such a map would provide a valuable tool to investigators wanting to identify genes involved in common diseases. One would then have to select only a few SNPs in each block in a candidate region without losing much power in an association study (see Figure 2). It is estimated that this map would reduce by 10- to 30-fold the number of SNPs necessary to do any association study targeting common risk alleles. For whole-genome association studies, this may involve testing approximately 300 000 to 1 million SNPs, instead of the 10 to 15 million common SNPs in the human genome. The International HapMap Consortium, composed of teams from Canada, the United States, the United Kingdom, Nigeria, China, and Japan, was established in 2002 to build a map that will be useful for association mapping in any human population. Details of the approach were published in the first year of the project (International HapMap Consortium (2003)). As discussed above, although block boundaries are similar between populations of various ancestries, many differences, notably in block length and haplotype frequency, can be observed. Thus, four distinct populations (of European, Chinese, Japanese, and Yoruba ancestry) were initially chosen for their inclusion in the HapMap project. These samples are thought to include a significant amount of the overall human genetic variation and thus, the results should be applicable to most association studies. The human haplotype map should be completed by the end of 2005, but genotypes are made available to all researchers on the web as soon as they are produced. As of May 2004, more than 600 000 SNPs have been genotyped on HapMap samples.

SNPs, haplotypes, and tag SNPs. (a) SNPs. Shown is a short stretch of DNA from four versions of the same chromosome region in different people. Most of the DNA sequence is identical in these chromosomes, but three bases are shown where variation occurs. Each SNP has two possible alleles; the first SNP in panel a has the alleles C and T. (b) Haplotypes. A haplotype is made up of a particular combination of alleles at nearby SNPs. Shown here are the observed genotypes for 20 SNPs that extend across 6000 bases of DNA. Only the variable bases are shown, including the three SNPs that are shown in panel a. For this region, most of the chromosomes in a population survey turn out to have haplotypes 1-4. (c) Tag SNPs. Genotyping just the three tag SNPs out of the 20 SNPs is sufficient to identify these four haplotypes uniquely. For instance, if a particular chromosome has the pattern A-T-C at these three tag SNPs, this pattern matches the pattern determined for haplotype 1. Note that many chromosomes carry the common haplotypes in the population.

Figure 2 SNPs, haplotypes, and tag SNPs. (a) SNPs. Shown is a short stretch of DNA from four versions of the same chromosome region in different people. Most of the DNA sequence is identical in these chromosomes, but three bases are shown where variation occurs. Each SNP has two possible alleles; the first SNP in panel a has the alleles C and T. (b) Haplotypes. A haplotype is made up of a particular combination of alleles at nearby SNPs. Shown here are the observed genotypes for 20 SNPs that extend across 6000 bases of DNA. Only the variable bases are shown, including the three SNPs that are shown in panel a. For this region, most of the chromosomes in a population survey turn out to have haplotypes 1-4. (c) Tag SNPs. Genotyping just the three tag SNPs out of the 20 SNPs is sufficient to identify these four haplotypes uniquely. For instance, if a particular chromosome has the pattern A-T-C at these three tag SNPs, this pattern matches the pattern determined for haplotype 1. Note that many chromosomes carry the common haplotypes in the population.

Despite its initial successes, many questions remain unanswered. For example, will the HapMap results be applicable in a different population than those used to create the HapMap? It is known that African subpopulations can exhibit relatively greater differences between each other, as what is usually observed when comparing non-African populations. Likewise, founder populations of Caucasian origin may well present differences in haplotype structure (i.e., length of block) and diversity. It has been shown also that allelic frequency and SNP density can alter, sometimes dramatically, the observed block structure (Zondervan and Cardon, 2004, Ke et al., 2004, Schulze et al., 2004). The HapMap Consortium elected to use SNPs with a minor allelic frequency of at least 5% in each population used to build the map. This threshold is consistent with those used in the study of common variants causing common diseases (Lander, 1996, Lohmueller etal., 2003; see also Article 11, Mapping complex disease phenotypes, Volume 3). Low frequency variants could still, in principle, be identified using the map, although usually with much lower power to detect them (Zondervan and Cardon, 2004).

It is feasible to produce a haplotype map of a particular locus and/or in a different population. It is necessary to genotype a high-density set of SNPs on a subset of the samples and build a haplotype map that could be used to select tag SNPs to use in subsequent studies in larger cohorts from a population. The first step in building a haplotype map is to define the number of samples needed to build the map. The answer to this question depends on haplotype diversity, but in general, panels of 90 unrelated individuals or 30 trios should be sufficient to detect haplotypes with greater than 5% frequency. Using unrelated samples instead of families can maximize the number of independent chromosomes as many analysis programs can use unrelated samples to accurately determine haplotype phase, including PHASE (Stephens etal., 2001), EM method (Excoffier and Slatkin, 1995). However, the use of families can improve the phasing of haplotypes when LD is lower and also provide a measure of genotyping errors (using tests for Mendelian inheritance).

The second step is to select the required density of SNPs. Ke etal. (2004) observed that a density of atleast 1 SNP every 2kb was needed to stabilize the block boundaries. However, since LD is not homogeneous throughout the genome and because haplotype blocks vary greatly in length, a hierarchical approach was adopted by the International HapMap Consortium (2003). In Phase I of the HapMap project, 1 SNP every 5 kb was genotyped in every population sample to identify large blocks. This 5 kb scan is expected to generate a map with approximately 60% of the genome in blocks. Additional markers are added in regions where LD is too low (i.e., where it is not possible to accurately predict the genotype of a nearby SNP) to cover an additional 20-30% of the genome. This strategy should be valuable for any kind of haplotype map. As of March 2004, there were more than 7 million SNPs in dbSNP, with a large fraction discovered by several resequencing projects of the International HapMap Consortium (2003). More than one million SNPs will be validated by the HapMap project, providing a valuable resource to the human genetics community. It was also shown that using double-hit SNPs (i.e., SNPs independently observed two or more times in SNP discovery projects) improves the overall success rate and correlates with a higher average minor allelic frequency. The third step is to choose the appropriate genotyping platform. Many different platforms exist on the market with wide variation in throughput and price. Highly parallel processing, from a few SNPs to several thousands, has resulted in a reduction of the price per genotype. However, depending on the number of SNPs genotyped, some can be more advantageous than others (see Article 77, Genotyp-ing technology: the present and the future, Volume 4). The last step is to select a set of tag SNPs that will accurately describe each block in the selected samples and will be used subsequently to genotype the remaining samples (see below).

4. Selecting haplotype tag SNPs

One of the most important contributions of a haplotype map will be the description of a set of haplotype tag SNPs, specific for a population, that will improve significantly the efficiency of subsequent association studies. One straightforward method to pick htSNPs is presented in Figure 2. However, there can be as many htSNP sets as there are haplotype block definitions. Also, even though a blocklike description of the genome sounds appealing, there might be some problems if we rely on them to pick htSNPs: first, there will always be significant LD between adjacent blocks, so a strategy that would try to tag each block, especially very small ones, could be inefficient. Finally, it will be impossible to account for those regions that lie outside a block. Cardon and Abecasis (2003) showed that the best htSNP selection should not be dependent on the concept of blocks but on more general patterns of LD and haplotype diversity. This means that in a selected region, the chosen SNPs must capture the variation that exists at the unexamined sites by LD. Another important variable to consider is the balance between the reduction in the number of SNPs to genotype and the reduction in the power to detect an association.

Zhang et al. (2002b) noted that the power is reduced by about 4% when 25% of the total number of SNPs, selected by their method, is used, compared to a drop of 12% when a random SNP set of similar size is chosen instead. When 14% of the SNPs are used, the power drops by 9 and 21%, respectively. Also, power loss was much greater using single-locus association tests than with a two-locus haplotype approach. Although other published methods seem to perform similarly, the different algorithms do not identify the same number of blocks or htSNPs. For example, Zhang et al. (2002a), using the same data as Patil et al. (2001), were able to reduce the number of haplotype blocks and htSNPs selected by more than 20%. Various other methods have recently been published that attempt to minimize the number of tag SNPs (Ke and Cardon, 2003, Meng et al., 2003, Carlson et al., 2004, Schulze et al., 2004, Sebastiani et al., 2003, Stram et al., 2003), pointing to the fact that the optimal way to mathematically define the problem has probably not yet been described or will vary on the basis of local LD in different regions of the genome. Since r2, a measure of LD, is inversely proportional to the increase in the sample size required to detect an indirect association, the most useful approach, with the knowledge of the Haplotype Map, would be to pick the minimal set of htSNPs that describe all other markers using a r2 threshold (ex:0.8). This allows the sample size needed to detect the association to be easily calculated (Carlson et al., 2003).

In the past decade, significant progress was made in defining the genetic basis of common diseases as well as in genotyping technologies and statistical methodologies. However, the mapping of common disease genes is still an arduous task today. The understanding of the haplotype structure of human chromosomes is crucial to improve the success of association studies. With results from the HapMap project available to all researchers, many studies that previously would have been done with a few nonvalidated SNPs can now be attempted with sufficient power to effectively screen hundreds of candidate genes and large chromosomal regions. It is important to realize that the haplotype mapping approaches are still in their infancy and need to be improved for the full potential of the Human Haplotype Map to be achieved: (1) genotyping costs need to decrease; (2) analysis methods must improve; and (3) pilot studies must evaluate the effect of using different populations or the impact of disease polymorphisms, with various penetrance and allelic frequency, on the success of association mapping of common traits. Despite the caveats, the Haplotype Map will be a powerful tool to systematically screen the genome and elucidate the genetic causes of common genetic diseases.

Next post:

Previous post: