Linkage mapping (Genomics)

1. Introduction and scope

The purpose of this chapter is to provide a practical guide to linkage mapping for the identification of genes predisposing to human disease (or other interesting phenotypes). The emphasis will be on technical issues and pedigree-based analysis. More theoretical concerns, particularly those relating to methods in statistical genetics, will be covered in depth elsewhere (see Article 48, Parametric versus nonparametric and two-point versus multipoint: controversies in gene mapping, Volume 1, Article 52, Algorithmic improvements in gene mapping, Volume 1, Article 58, Concept of complex trait genetics, Volume 2, and Article 11, Mapping complex disease phenotypes, Volume 3). Alternative approaches such as linkage disequilibrium (LD) and SNP-based association mapping are covered in other chapters (see Article 12, Haplotype mapping, Volume 3, Article 17, Linkage disequilibrium and whole-genome association studies, Volume 3, Article 69, Reliability and utility of single nucleotide polymorphisms for genetic association studies, Volume 4, Article 73, Creating LD maps of the genome, Volume 4, and Article 74, Finding and using haplotype blocks in candidate gene association studies, Volume 4).

2. General approaches to linkage mapping

Linkage as a formal term refers to the mapping of a predisposing polymorphism or mutation at a genetic locus through the analysis of chromosomal segments transmitted to individuals with some known degree of relationship (Ott, 1991; Terwilliger and Ott, 1994) (see Figure 1). The enabling principle for linkage mapping in humans is the use of anonymous polymorphic DNA variants, called markers, as tags for these chromosomal segments (Botstein etal., 1980). Using these tags, one can detect the correlated inheritance of a particular trait with that of closely linked marker loci.


The genetic mapping described here results preferentially from the analysis of pedigrees showing Mendelian segregation of the trait of interest.By this we mean: phenotypes are relatively straightforward to characterize; transmission in families is generally unilineal and unambiguous (although this can be confounded in some populations that exhibit high degrees of consanguinity); and underlying sequence variants usually confer obvious and severely deleterious effects on gene function (or in rarer cases obvious and severe gain of gene function). For the detection of linkage, large families have more statistical power than small ones, but these are not always available, especially for traits with low penetrance, delayed age of onset, or for traits of complex etiology (Haines and Pericak-Vance, 1998; see also Article 58, Concept of complex trait genetics, Volume 2 and Article 11, Mapping complex disease phenotypes, Volume 3).

Schematic visualization of a pedigree segregating a biological phenotype of interest. According to convention, square symbols are male, circles are female. Affected individuals are shaded in black. The four hypothetical founding haplotypes for a given chromosome are indicated in red, blue, green, and yellow. Additional copies of this chromosome, introduced by spouses marrying into the pedigree, are displayed as unshaded bars. In this example, it is presumed that the affected founder is known for this pedigree (male in top generation). A causal mutation at a specific locus is presupposed by the X on the red haplotype. Recombination events reduce the extent of the red haplotype transmitted through the pedigree. In this idealized example, there is perfect cosegregation of the mutation (X) and a surrounding segment of red haplotype, in all affected individuals

Figure 1 Schematic visualization of a pedigree segregating a biological phenotype of interest. According to convention, square symbols are male, circles are female. Affected individuals are shaded in black. The four hypothetical founding haplotypes for a given chromosome are indicated in red, blue, green, and yellow. Additional copies of this chromosome, introduced by spouses marrying into the pedigree, are displayed as unshaded bars. In this example, it is presumed that the affected founder is known for this pedigree (male in top generation). A causal mutation at a specific locus is presupposed by the X on the red haplotype. Recombination events reduce the extent of the red haplotype transmitted through the pedigree. In this idealized example, there is perfect cosegregation of the mutation (X) and a surrounding segment of red haplotype, in all affected individuals

3. Properties of genetic markers

Over the years, a variety of different genetic markers have been used for mapping purposes. For the past decade, the markers of choice for linkage have been microsatellite repeats, also known as VNTRs (variable number of tandem repeats) or STRs (short tandem repeats) (see Figure 2) (Litt and Luty, 1989; Taylor et al., 1989; Beckmann and Soller, 1990; Weber, 1990). They consist of stretches of repeating units such as CACACA or GATAGATAGATA, embedded within unique sequences in various chromosomal locations. For the most part, they lie outside the coding exons of genes, since varying repeat lengths other than triplets would otherwise lead to frameshift mutations. A particular repeat marker may be unambiguously detected using appropriately designed PCR primers in the surrounding unique sequence. These repeats frequently vary in length in different individuals, presumably because of occasional mistakes made by the replication machinery. Such events are relatively infrequent, so that these repeats are stable enough to be used in studies spanning multiple generations in a pedigree. They are not wholly stable however, and the identity of microsatellite alleles by state is usually not sufficient to infer identity by descent in individuals of distant or unknown genealogical relationship (with exceptions in founder populations). In order to be useful in linkage analysis, a marker must have multiple alleles present in a population. The more alleles the better; however, owing to the limitations of analytical approaches, it is usually best to use markers with no more than 12-15 alleles. The information content of markers is commonly measured using the heterozygosity value and the polymorphism information content (PIC) value. Dinucleotide microsatellites are typically more informative than tri- or tetranucleotide repeats, with heterozygosities as high as 0.7-0.8.

Typical dinucleotide microsatellite repeat marker. A (CA)n repeat is embedded within a unique sequence context, which provides for the development of marker-specific PCR amplification primers. Below is a chromatogram for a real dinucleotide repeat, D21 S1914, located on chromosome 21. Genotypes are shown for four individuals, first and fourth are homozygotes, second and third are heterozygotes. Relative allele sizes (i.e., names) are indicated in small black boxes below the highest molecular weight peak for each allele.

Figure 2 Typical dinucleotide microsatellite repeat marker. A (CA)n repeat is embedded within a unique sequence context, which provides for the development of marker-specific PCR amplification primers. Below is a chromatogram for a real dinucleotide repeat, D21 S1914, located on chromosome 21. Genotypes are shown for four individuals, first and fourth are homozygotes, second and third are heterozygotes. Relative allele sizes (i.e., names) are indicated in small black boxes below the highest molecular weight peak for each allele.

It is estimated that the human genome has 5000-10 000 such microsatellite repeats. For these markers to be used in a genetic mapping study, their relative order and distance along each chromosome must be known. Several genetic maps have been generated, through marker genotyping in large families with multiple meioses, and using recombination data to orient and locate markers with respect to each other (Weissenbach etal., 1992; Gyapay etal., 1994; Broman etal., 1998; Yu et al., 2001; DeWan et al., 2002; Kong et al., 2002). Now almost all markers can be placed unambiguously on the assembled human genome sequence.

The human nuclear genome consists of approximately 3600 centimorgans (cM) in genetic distance (averaged over both sexes) (Kong et al., 2002). Thus, to cover the genome at 10-cM resolution requires 360 genetic markers, assuming each is fully informative; 5-cM resolution requires 720 markers. Good microsatellites of high information content come close to these limits, hence genome-wide mapping panels have on the order of 400 (10 cM) to 800 (5 cM) markers respectively. Such sets are commercially available (Reed et al., 1994; Lindqvist etal., 1996).

The general approach to a linkage mapping experiment is to perform a whole-genome scan at approximately 5- or 10-cM density on a set of samples from one or more families transmitting the phenotype of interest. Following statistical analysis, potential regions of linkage are identified on various chromosomes. Each of these is subjected to genotyping of additional microsatellite markers at increased density, followed by reanalysis. Ideally, only a single region survives the second round of mapping. A third round of genotyping with an increased density of markers, potentially exhausting all microsatellite repeats in a region, may follow. Owing to the relatively high cost of complete genome scans, often only subsets of sampled individuals are genotyped in the initial phase, focusing on those carrying the most definitive phenotypic state.

In some situations, linkage analysis may begin with or be restricted to specific genes with a higher biological probability of involvement in the phenotype, often termed candidate genes. The general principles of mapping are the same, but practically this reduces the scope of genotyping to smaller sets of markers near these genes, with potentially significant cost savings. Often, this approach is used to exclude genes already known to mutate to the phenotype of interest, in newly ascertained families.

For fine-mapping of recombination events in specific meioses, commercially available genome scan panels have insufficient resolution. Therefore, laboratories involved in linkage mapping must develop additional custom markers. Public databases include many thousands of potentially available microsatellite markers that can be used for such fine-mapping. These are typically identified with a “D” number such as D1 S2134, indicating the chromosome such as chr1 plus the specific marker number – however, some nonpolymorphic PCR amplification products or sequence tagged sites (STSs) were historically assigned D numbers; thus, such numbers are not automatically indicative of microsatellite repeat status. Moreover, some useful markers have never been assigned D numbers but retain only the numbers from the projects in which they were developed, such as Utah markers (UT numbers) (Utah Marker Development Group, 1995), Marshfield markers (Mfld numbers) (Broman etal., 1998; Weber, 1990; Weber and Broman, 2001; DeWan etal., 2002), and Genethon markers (AFM numbers) (Weissenbach etal., 1992; Gyapay etal., 1994; Reed etal., 1994). Current databases attempt to unify all known markers and aliases for each marker, but it is not necessarily true that two differently named markers are truly different.

Once all identified markers in a region have been exhausted, the genome sequence in chromosomal regions of interest can be directly examined for additional microsatellite motifs using standard bioinformatics tools. The total potential resolution of microsatellite markers is on the order of 0.2 to 1 cM, which is typically sufficient for positional cloning purposes. Recall that the smallest definable interval containing a putative causal variant is a function of recombination events that have occurred in meioses between the patients who have been sampled. The actual size of the recombinant interval is not dependent on the density of markers being used to analyze it. Only the resolution with which the exact site of recombination is mapped is affected by the density of markers employed.

Recently, commercial mapping panels of single-nucleotide polymorphisms (SNPs) have begun to come into use for linkage mapping (Tsai et al., 2003; Sellick et al., 2004). A single microsatellite marker is often considered to have equivalent information content to 3-4 SNPs. SNP technologies will not be reviewed here (see Article 50, Gene mapping and the transition from STRPs to SNPs, Volume 1), however, whole-genome mapping sets are now commercially marketed. These sets initially contained in the range of 4-10 000 SNPs, roughly equivalent to a 5-cM microsatellite genome scan, and are designed for family-based linkage analysis. More recently, SNP panels of 100 000 markers have been developed. As SNPs are generally biallelic, it is intrinsically more straightforward to generate allele calls, so manual review may be unnecessary at least for standardized marker sets. SNPs are believed to be stable over long periods of time. A given SNP is usually presumed to have arisen once only during evolution (although some nucleotide positions may turn out to be unstable and mutate repeatedly). Thus, identity of state for an SNP site in two individuals is considered to be indicative of identity by descent. For fine-mapping equivalent to microsatellites at about 1-cM resolution, hundreds of thousands to millions of potential SNPs are available in public databases, however, the informativeness of these must be evaluated for specific patient samples in a family-mapping study (Sachidanandam et al., 2001; Holden, 2002). Even if SNP sets become the standard tool for low-resolution genome scanning, microsatellites will probably continue to play a useful role in fine-mapping for this reason.

4. Microsatellite genotyping

Microsatellites are typically assayed following PCR. One of the PCR primers is usually tagged with a fluorescent dye, and the products of PCR are resolved electrophoretically either on polyacrylamide gels or by capillary electrophoresis (Ziegle etal., 1992; Gelfi etal., 1994; Reed etal., 1994; Gyapay etal., 1996; Lindqvist et al., 1996; Mansfield et al., 1996; Ghosh et al., 1997; Mansfield et al., 1997; Vainer etal., 1997; Wenz etal., 1998; Delmotte etal., 2001; Wenz etal., 2001). Although the concept is straightforward, there are potential technical pitfalls.

PCR primers must amplify the marker in question with high specificity. Ideally, both primers should lie in unique sequence; however, in practice, this is sometimes difficult to achieve as microsatellites often lie in or near repetitive elements. Hence, PCR conditions may require optimization to generate sufficiently specific products. For genome scan mapping panels, standardized conditions have been developed and are available, although laboratories should be prepared to reoptimize if needed. In developing custom microsatellite markers for fine-mapping, laboratories must usually develop their own PCR conditions. One may move PCR primers in addition to altering reaction conditions, as long as the primers remain specific to the repeat unit under development. Indeed, the exact primer sequences of commercial marker kits may be proprietary and different from public database primers for those markers.

When DNA is extracted from patient blood samples, both maternal and paternal chromosomes are recovered, hence both alleles of a microsatellite marker are observed. On occasion, genotyping may employ DNA from single sperm cells, or from cell lines reduced to haploidy through cell fusion and chromosome loss. But for linkage analyses, blood samples are the usual source. Inactivation of X chromosomes in females presents no problem.

A typical dinucleotide chromatogram is shown in Figure 2, with examples of homozygous and heterozygous individuals. Although unique products have been amplified, note that there are multiple peaks even in the homozygotes. Extra so-called stutter peaks are observed, smaller than the full-length product, and presumed to result from enzymatic skipping during PCR. The spacing of stutter peaks is equivalent to the type of repeat unit. Stutter peaks do not generally impede genotyping. The size of a microsatellite allele is usually defined by the position of the largest molecular weight peak.

The enzymes typically used in PCR have a tendency to add additional, non-templated nucleotides at the 3′ ends of products, to a variable extent. Thus, microsatellite chromatograms historically have suffered from the problem of “peak-splitting”. This is separate from and in addition to the observation of stutter peaks. If this problem is severe enough, particular markers may be wholly useless. Specific added sequence elements can reduce the intrinsic variability of nontemplated addition. These sequence elements are added to the 5′ end of the nonlabeled PCR primer, so that variability in nontemplated addition is reduced at the 3′ end of the labeled strand, which is the strand visualized by the instrumentation (see Figure 2) (Brownstein etal., 1996; Magnuson et al., 1996).

Despite optimization, some markers do routinely give extra peaks, presumably because of additional priming sites in the genome. Such markers may still be useful if these peaks are sufficiently reproducible. However, automated genotyping programs may require additional training to deal with them. In some cases, extra peaks fall into the expected allele range for other markers multiplexed with the marker in question.

Microsatellite genotype calls, usually given in base pairs, are not exact but are relative to internal size standards, and as such are only indirect readouts of the actual number of repeats in a given allele in a given sample. These size standards may be purchased from commercial suppliers or synthesized in the laboratory (Brondani and Grattapaglia, 2001). Unfortunately, the interpolation of allele sizes is dependent on the specifics of the electrophoretic system used. Thus, genotypes are difficult to compare between different instrument platforms, and often between different laboratories’ versions of the same marker. One solution to this is to normalize all allele calls for a given marker to a standard DNA sample, such as a CEPH control DNA. This technique allows data to be pooled across multiple platforms, although standardized calls may need to be created independently for each different instrument.

To increase efficiency, multiplexing is typically performed. Unfortunately, microsatellites have proved recalcitrant to pre-PCR multiplexing. Therefore, multiplexing of microsatellites is usually performed after PCR and prior to electrophoresis.

Since multiplexed markers are subjected to electrophoresis in the same lane or capillary, it is critical to associate specific chromatogram peaks with the correct marker. Allelic size ranges are determined for a specific PCR primer pair used to amplify a marker, either based on public information or else by test genotyping a set of random control DNAs. Thus, peaks for a specific marker have an expected size range where genotypes are called. However, novel alleles are often observed when large numbers of experimental samples are subsequently genotyped for that marker. Some of these alleles may fall outside the expected range for that marker. In this case, trained software must be updated to incorporate the new alleles, which can be problematic if there is overlap with other markers that were previously multiplexed in the same lane. This problem is often identified through the failure of a marker neighboring the actual marker with the allelic expansion. In the worst case, individual markers may have to be removed from a multiplexed panel and electrophoretically analyzed separately. To minimize potential for subsequent allelic overlap, when a new panel of multiplexed markers is developed, a gap should be provided between the known ranges of size-adjacent markers.

Multiplexing also relies on the availability of multiple fluorescent dye tags with different emission spectra. Markers with overlapping size range but different dye tags may thus be pooled. Commercial systems typically permit four different dyes to be multiplexed, one of which is used for the internal size standard. The various fluorescent dyes alter the mobility of DNA fragments, so that the apparent electrophoretic mobility of a given marker will change if a different dye is substituted. A similar and even more severe problem may arise if the dye tag or spacer structure is altered on the internal size standard, in which case ALL marker mobilities may have to be redefined.

Commercial mapping panels have been optimized for marker dye color and spacing so that as many as 15-20 different microsatellites may be assayed in the same lane, significantly reducing cost and enhancing throughput. With effort, custom panels can also be highly multiplexed, although this may not necessarily be cost-effective.

Following electrophoresis and data collection, actual genotype calls must be made. This can be performed either fully manually or semiautomatically. Commercially available software packages exist for automated genotype calling (Applied Biosystems GeneMapper, SoftGenetics GeneMarker), but while effective, these packages require caution in actual use. In practice, some amount of manual review is always necessary, particularly for markers with complex chromatograms or extra peaks. Nonetheless, current software genotyping programs can be very efficient in reducing the required amount of manual trace review for well-behaved markers or markers with which laboratories have extensive experience.

It is recommended that if primers are redesigned, version numbers be used explicitly. It can be highly confusing if multiple versions of a marker, with slightly different primer sequences and or dye types, have the same name in a laboratory system. Version numbers may need to be removed prior to statistical analysis, since public database and map names will now be inconsistent with the internal identifiers.

For the research laboratory not equipped for whole-genome microsatellite mapping, there are several outsourcing alternatives. However, fine-mapping of potential linkages with custom markers is almost always the next step. Outsourcing such custom genotyping is more problematic, and laboratories with serious interest in linkage mapping are encouraged to develop at least some capabilities for internal genotyping.

For microsatellite PCR, 5-20 ng of high-quality genomic DNA are required for each marker. Thus, whole-genome scans at 5-cM resolution with follow-up demand 5-20 |g of DNA per patient. These quantities can routinely be achieved using fresh blood samples in the tens of milliliters, or equivalent frozen white cells (buffy coats), or from cell culture of immortalized fibroblast lines. In cases in which a blood draw is not possible, buccal (cheek) swabs may sometimes be obtained, yielding sufficient DNA for small numbers of reactions only. Recently, several protocols have been developed for whole-genome amplification, particularly suited for whole-genome SNP analysis since so many more markers are required. The utility of these protocols for microsatellite genotyping is not fully validated.

High-volume genotype data must be appropriately archived and made available to statistical geneticists. Integrating clinical, pedigree, and genotype data sets can be surprisingly challenging. Moreover, statistical analysis programs generally require very specific formatting of data. Unfortunately, there are few appropriate commercial database prototypes serving the needs of human geneticists, although ProgenyLab is a relatively recent entry in this area. Laboratories expecting to perform large amounts of linkage mapping are highly encouraged to develop the integrated database systems.

5. Statistical genetic analysis of linkage

The essence of linkage analysis is to detect the cosegregation of a particular chromosomal segment (defined through marker genotyping) with the phenotypic state of interest, in a set of related patients such as a single family (Ott, 1991; Terwilliger and Ott, 1994). The question is whether any particular chromosome segment in the genome cosegregates with the phenotype more frequently than one would expect by chance alone. To determine this probability usually requires elaborate mathematical analysis. The required statistical methodologies are discussed in detail elsewhere (see Article 50, Gene mapping and the transition from STRPs to SNPs, Volume 1, Article 52, Algorithmic improvements in gene mapping, Volume 1, and Article 17, Linkage disequilibrium and whole-genome association studies, Volume 3). Here, we give only the briefest overview of statistical genetics, to place the discussion of genotyping into a procedural context.

Statistical genetic tests are traditionally broadly subdivided into two main categories, those in which explicit modeling assumptions are made concerning the behavior of a presumptive causal allele and those in which no such assumptions are made. These are termed parametric (or model-based) and nonparametric (or model-free) analysis respectively. The terms “model-based” and “model-free” are preferred, however, as most methods labeled nonparametric do nonetheless rely on some genetic assumptions. In model-based analysis, assumptions are made that the disease gene-population frequency and the penetrance of the disease alleles in homozygotes and heterozygotes can be accurately estimated. When the mode of action of the disease gene cannot be predicted with confidence, such as is the case for complex diseases, model-free analyses are typically used. Generally, these simply test for excess sharing or preferential transmission of particular marker alleles in family units.

The most commonly used statistic for model-based linkage analysis is the maximum likelihood ratio. This tests the hypothesis of disease and marker cosegregation versus the null hypothesis of random segregation. For historical reasons of convenience, the base 10 logarithm of the ratio of the likelihoods is used and referred to as the LOD score (log of odds) (Morton, 1955). The conventional significance threshold used in linkage analysis is LOD > 3 for Mendelian diseases. The genome-wide significance threshold is generally set slightly higher to LOD = 3.3 for complex trait analysis. It is also possible to determine the significance of a test applied to a particular data set empirically using computer simulations. To this end, replicates of the family collection are generated by computer, with random genotypes based on correct inheritance, allele frequencies, and marker recombination fractions. The linkage testing procedure is conducted in each simulated dataset and the maximum LOD score or p-value is noted. The genome-wide threshold of significance is taken as a score that is exceeded in fewer than 5% of replicates.

Statistical linkage analysis can be performed using either a single genetic marker at a time (two-point linkage, that is, disease locus and marker locus), or alternatively using multiple genetic markers simultaneously (multipoint linkage). The advantage of using multiple markers is that the phase of markers can be estimated with more precision, and this can add considerable power to the test. Multipoint linkage calculations, however, require significant computational resources when sufficiently large pedigrees or numbers of markers are analyzed. Exact solutions of multipoint linkage are incomputable for very large pedigrees or marker sets with the commonly employed tools such as LINKAGE, FASTLINK, ALLEGRO, GENEHUNTER, MERLIN, and VITESSE (Lathrop et al., 1984; Cottingham et al., 1993; Schaffer etal., 1994; O’Connell and Weeks, 1995; Kruglyak etal., 1996; Gudbjartsson etal., 2000; Markianos etal., 2001; Sobel etal., 2001; Abecasis etal., 2002). Researchers usually accept the limitations on pedigree size or number of markers, which can be examined simultaneously. There are programs such as SIMWALK, LOKI, and MCSIM that estimate inheritance vectors using approximation methods (Weeks etal., 1995; Heath, 1997; Thomas etal., 2000). Such programs have been shown to give accurate LOD scores in the majority of cases, and provide a valid alternative when exact computations are impossible (see Article 52, Algorithmic improvements in gene mapping, Volume 1).

Linkage results may be presented numerically in tabular format, but for multipoint analysis, it is common to report results graphically, with scores for parametric or nonparametric linkage plotted as a function of position on a chromosome. In this way, recombination events that unlink chromosomal segments from the phenotype appear as drops in the linkage statistic (Figure 3).

Although direct visualization of allelic phases, haplotypes, and recombination events is not theoretically necessary for locus mapping, in practice, it is widely used for manual review of linkage analysis results (Figure 4). The definition of the haplotypes in a pedigree is achieved by the phasing of alleles at each genotyped marker. By phase, we mean the two alleles of each marker must be assigned as having been transmitted from the paternal or maternal parent. For fully informative markers the process is simple, but for real markers in incompletely sampled pedigrees, phasing of alleles requires mathematical techniques. As with multipoint LOD score calculation, phase determination for large pedigrees and marker sets is computationally restrictive. An added difficulty in visualizing multimarker haplotypes is incorporating them into pedigree drawings. Pedigree drawing packages such as Cyrillic, Progeny, or PedDraw are all usable, though all have limitations in dealing with large haplotypes or complex pedigrees. Haplotype reconstruction analyses can also be performed using approximation algorithms as implemented in SIMWALK or MCSIM (which must be interpreted cautiously), or else smaller pedigrees or subsections of pedigrees can be haplotyped exactly using GENEHUNTER or MERLIN and manually assembled.

Multipoint linkage. mpLOD score as calculated by the MCSIM algorithm is plotted versus chromosomal location for a dense set of fine-mapping markers near the FH3 hypercholesterolemia locus

Figure 3 Multipoint linkage. mpLOD score as calculated by the MCSIM algorithm is plotted versus chromosomal location for a dense set of fine-mapping markers near the FH3 hypercholesterolemia locus

Haplotype visualization. A typical pedigree is shown with marker alleles phased using Genehunter. Each individual is uniquely identified by generation (in Roman numerals) and place (in Arabic numerals). Filled symbols designate affected individuals, open symbols are unaffected individuals, question marks inside symbols indicate individuals whose diagnosis is either unknown or ambiguous. Individuals with DNA samples collected are indicated. Genetic markers (anonymized) are listed in chromosomal order on the left. Beneath each genotyped individual, allele sizes are given for each marker; question marks here indicate a failure to call an allele for that marker/individual combination. Alleles have been phased so that chromosomal haplotypes may be visualized directly, although in this example only one haplotype is explicitly identified by a black bar. Recombination events may be observed in individuals IV:079 and IV:001

Figure 4 Haplotype visualization. A typical pedigree is shown with marker alleles phased using Genehunter. Each individual is uniquely identified by generation (in Roman numerals) and place (in Arabic numerals). Filled symbols designate affected individuals, open symbols are unaffected individuals, question marks inside symbols indicate individuals whose diagnosis is either unknown or ambiguous. Individuals with DNA samples collected are indicated. Genetic markers (anonymized) are listed in chromosomal order on the left. Beneath each genotyped individual, allele sizes are given for each marker; question marks here indicate a failure to call an allele for that marker/individual combination. Alleles have been phased so that chromosomal haplotypes may be visualized directly, although in this example only one haplotype is explicitly identified by a black bar. Recombination events may be observed in individuals IV:079 and IV:001

As individual SNPs are insufficiently informative for most types of linkage analysis, replacement of microsatellites with SNP-based linkage mapping sets will demand multipoint linkage and haplotyping analyses. Current software packages such as Genehunter and Vitesse have not yet been extensively tested in such scenarios, however, it is anticipated that they will be applicable. New tools for SNP analysis include SNPLink, ALOHOMORA and HaploPainter (Ruschendorf and Nurnberg, 2005; Thiele and Nurnberg, 2005; Webb et al., 2005).

All linkage algorithms require correct Mendelian inheritance patterns of the individual marker alleles. This can be tested prior to linkage analysis using the PedCheck program, which verifies the structural integrity of the pedigrees and the Mendelian inheritance of alleles irrespective of phenotypic status (O’Connell and Weeks, 1998). When errors are detected, they may sometimes be corrected by review of the raw genotype data. However, some inheritance errors cannot be explained by any obvious technical mistakes. In such cases, it is advised to eliminate the allele calls of all pedigree members involved in the nuclear families generating the error, or in more severe cases eliminating a marker completely from analysis. One source of such inheritance errors is spontaneous mutation of a microsatellite allele to a different repeat length. Much theoretical attention has been given to the topic of unidentified genotype errors in data sets.

In monogenic as well as complex disorders, mutations in different genes can result in a similar phenotype, so that groups of families displaying a shared phe-notype may not segregate a causal variant in the same gene (genetic or locus heterogeneity). Care must be taken therefore when pooling pedigrees for linkage analysis. It may be possible, in some instances, to subgroup pedigrees according to subtle phenotypic differences. Alternatively, robust statistical analyses that allow for locus heterogeneity in the calculation of heterogeneity LOD scores can be used. Those methods will improve the power of linkage detection. Alternatively, different families may segregate different mutations in the same gene for a given phenotype (allelic heterogeneity). In this case, different families will generate linkage to the same chromosomal interval although not sharing the same marker haplotype. In special cases, such as French Canada, Newfoundland, Finland, and so on, identical chromosomal segments or haplotypes can often be detected in different family units whose genealogical relationship may not be known (de la Chapelle and Wright, 1998; Laan and Paabo, 1998; Arcos-Burgos and Muenke, 2002). Such populations are frequently referred to as founder populations or population isolates.

One special subset of linkage analysis is homozygosity mapping. The general principles of recessive trait mapping are the same as for dominant traits. Accurate statistical analysis requires estimates of mutation and marker allele frequencies, which are not easily obtained in advance. However, in special cases, one can assume that affected individuals have received two copies of the same mutant allele. Such examples are common for specific populations with known high rates of consanguinity caused by either geographical or cultural factors, including founder populations (Lander and Botstein, 1987; Sheffield etal., 1998). Homozygous haplotype mapping can be applied by manual inspection if all affected patients from a population are homozygous for the same marker alleles for several successive markers in a genome scan. Failure to detect perfect shared marker homozygosity does not rule out the hypothesis, since recombination events may have unlinked marker alleles from the disease allele in some affected individuals in a data set. Moreover, even in relatively isolated populations, it is possible for multiple mutations in a gene to be segregating, leading to haplotype heterogeneity and failure of homozygosity mapping. Nonetheless, this can be an extremely powerful approach.

6. Positional cloning

The ultimate purpose of linkage mapping is to define recombinant intervals sufficiently small to support direct molecular screening of DNA sequences for causal variants. It is beyond the scope of this chapter to discuss this process, positional cloning, in full detail. However, there is lively interest in significantly reducing the cost of DNA sequencing, and it is not impossible that something like a large scale gene-screening approach could become cost-effective within the next few decades, which in theory could obviate the need for mapping (see Article 7, Single molecule array-based sequencing, Volume 3).

One can approximate the LOD score equivalent to a given interval size with the formula: size in cM = 60/LOD, such that a LOD score of 3.0 is on average equivalent to an interval of 20 cM (Sham, 1998). Even in the case of a monogenic disorder, with a well-defined chromosomal locus and constrained recombinant boundaries, it has historically been a significant challenge to identify causal variants. With the advent of the human-genome project, in the best case laboratories can simply resequence annotated genes within an interval (see Article 23, The technology tour de force of the Human Genome Project, Volume 3 and Article 24, The Human Genome Project, Volume 3). More typically, time and resources must still be devoted to clarifying gene content. Novel genes and new exons of incompletely defined gene transcripts, still arise at a significant rate in positional cloning projects, although this should become less of an issue within the next several years.

There are numerous approaches to mutation detection. Ultimately, the gold standard remains direct DNA sequencing, but several other physicochemical techniques (denaturing HPLC, mismatch scanning, chip-based resequencing, etc.) have been developed as alternatives for first-pass analysis. This remains a fast-moving field with technical improvements ongoing. Even for direct sequencing, software tools for more highly automated detection of sequence variants are still in development (Polyphred, Sequencher, Staden Tracediff, Softgenetics MutationSurveyor, etc.).

For linkage-based positional cloning, rare variants, typically resulting in obvious changes in gene function (frameshifts, stop codons, missense changes in conserved or biochemically validated amino acid residues, changes in conserved splice junction elements, etc.), are generally well accepted as causal for rare phenotypes especially when they are absent or at vanishing small frequencies in the general population (usually such a mutation is defined as undetected in at least 100 random control individuals). The overall mutation rate in the general population is such that for severe loss-of-function mutations there are usually many different allelic mutations detected for the same phenotype. Once a positional cloning project has led to provisional identification of a small number of mutations, follow-up validation of new mutations in additional patients often ensues rapidly. In the cases where phenotypes are proposed to arise from unusual types of mutations, such as specific gain-of-function changes, validation may be more challenging.

7. Future of linkage mapping

We believe that linkage mapping still plays an important role in disease gene identification. Of the proposed 25-30 000 human genes (Jaillon etal., 2004), fewer than 2000 have identified genetic variants associated with a clear phenotype in the OMIM database. For discovering the phenotypic effects of mutation in the remaining genes, especially for high penetrance mutations such as severe loss-of-function, traditional family-based linkage analysis is still the most efficient technology available. Homozygosity mapping of recessive phenotypes in appropriate populations, through a hybrid linkage/LD approach, can be even more efficient in the gene discovery process.

Next post:

Previous post: