Genome mapping overview (Genomics)

1. Introduction

True physical maps, for example, fingerprint maps in which each clone is “fingerprinted” on the basis of the pattern of fragments generated by a restriction enzyme digest, of large-insert molecular clones have had a tremendous impact upon large-scale genomic DNA sequencing. The physical map of the human genome (McPherson et al., 2001) provided the necessary scaffolding for accurate final compilation of the human genomic DNA sequence. Computationally deconvoluting an entire shotgun sequence of small DNA fragments from large genomes is challenging, some more so than others. These final assembly steps are immeasurably helped by the availability of an ordered set of markers or molecular clones. Reference sets of ordered molecular clones have also impacted many other fields. For example, the use of mapped bacterial artificial clones (BACs) for array-based comparative genomic hybridization (CGH) (Ishkanian et al., 2004), and the use of mapped BACs to identify chromosomal rearrangement breakpoints (Volik et al., 2003) are just two of the many mapping spin-offs that have become widely used tools for disease analysis and gene discovery.

2. Genetic maps

Genetic linkage maps have provided a key framework for all of the mammalian genomes that currently have published draft DNA sequence assemblies. Genetic maps were first constructed in the early twentieth century (Morgan, 1910) and preceded all other map construction methods. These early maps started with the meiotic mapping of chromosome landmarks, as defined by phenotypes such as eye color in flies. They next moved to biochemical markers such as blood antigens or isozymes and finally evolved into our current use of multiple allelic polymorphic DNA markers. These markers are commonly referred to as simple sequence repeats (SSR) and single-nucleotide polymorphisms (SNP). Even today, the markers that initially anchored the first crude linkage maps sometimes still serve as the basis for defining quantitative trait loci (QTLs) in agriculturally relevant species and in organisms used as human disease models. The basic premise for building a genetic map from polymorphic DNA markers has not changed, and involves the typing of a sufficient number of multiple allele markers in a defined population and use of standard quantitative genetic formulas to place markers into ordered linkage groups. The SSR-based linkage map has been the traditional choice for linkage analysis studies, whereas SNP maps are more commonly used in association studies (see review; Blangero, 2004). Linkage analysis attempts to define alleles within chromosomal regions that are identical by descent in a pedigree or set of pedigrees, using phenotypic data to infer this relationship. Association analysis, in contrast, does not necessarily rely on related individuals but instead tests the frequency of an observed genotype on the distribution of phenotype across a large population. Both methods rely on linkage disequilibrium between the observed alleles and the quantitative trait of interest. Several groups have successfully used a combination of linkage and association methods to link genotype to disease causation (Styrkarsdottir et al., 2003; Helms et al., 2003; Pajukanta et al., 2004).

In the past decade, the development of large-model organism-specific SNP databases has provided the impetus to design whole-genome association studies to infer the existence of heritable QTLs. These studies may suffer from large stochastic variances in linkage disequilibrium and may thus require large numbers of SNPs, estimates range from 200 000 to 1 million (Carlson et al., 2004). Although these are daunting numbers, if costs and throughput continue on the current trend, we anticipate that the use of large whole-genome SNP panels will become a routine method for the discovery of candidate QTL or disease-associated regions, especially for complex diseases. The availability of a human haplotype map (which will catalog the more common haplotype blocks in humans (The International HapMap Consortium, 2003) will further augment these association studies by enabling us to more efficiently design mapping studies on samples of different ethnic origin. It appears inevitable that the construction of whole-genome genetic linkage maps will decline in nonmodel and non-food-producing organisms, and they are certainly not absolutely required to derive a finished genomic DNA sequence. However, it seems likely that they will continue to be used in defining the etiology of heritable traits in many plants and animals. It is also likely that the successful use of SNPs in association studies of complex human disease will require improvements in the collection and quantitation of many phenotypic parameters for very large sample sets (Bell, 2004). In the end, even association studies can only pinpoint a region, an SNP or set of SNPs that is associated with increased disease risk. Improvements in the targeted resequencing of large genomic regions will be necessary to uncover all of the DNA variation that exists in affected individuals.

3. Physical maps

Physical maps are relatively new and have advanced rapidly in the last decade owing to advances in clone manipulation and high-throughput automation. Physical mapping (the positioning of molecular clones along a genome) has been a central technology in deriving a finished genomic DNA sequence for many genomes. Most notably, the high-quality DNA sequence of the human and many other model organism genomes has greatly accelerated our ability to map a phenotype of interest to a defined chromosomal region. A plethora of physical mapping techniques exist and fall into three general categories: (1) cytogenetic characterization, in which fluorescent in situ hybridization (FISH) is the most used method, whereby markers are localized along a defined chromosomal axis (Heng et al., 1997); (2) radiation hybrid mapping, in which hybrid cell lines randomly segregate chromosome fragments. These fragments can then be ordered by the PCR amplification of specific markers across a panel of cell lines (Cox, 1992); and (3) restriction mapping, which relies on random distribution of restriction enzyme sites in a genome (Olson et al., 1986; Coulson et al., 1986; Marra et al., 1997).

Improvements in the resolution of FISH techniques have come from higher-resolution microscopy, more efficient dye labeling methods and improved template preparation, such as interphase nuclei FISH and fiber-FISH (Florijn et al., 1995). Molecular cytogenetics has experienced a resurgence in recent years with the modification of methods such as spectral karyotyping SKY (Weimer et al., 2001), and the advent of comparative genomic hybridization CGH (Kallioniemi et al., 1992). These have had a particular impact upon cancer (see review; Heng et al., 2004), but new methods such as array CGH (a fusion of CGH, large-insert genomic clones and microarray technologies; Pinkel et al., 1998) are providing insights into copy number changes in many other complex diseases. Radiation hybrid (RH) maps, as briefly mentioned above, evolved as a result of the need to improve marker resolution over genetic maps and to not rely on polymorphisms to order markers along a chromosomal axis. The use of RH maps for anchoring sequences along a chromosome and constructing synteny relationships among species is now well documented (Mouse Genome Sequencing Consortium, 2002; International Human Genome Sequencing Consortium, 2001). Another method related to RH mapping is HAPPY mapping, although to date its use has been relegated to chromosomes and not to whole-genome maps (Thangavelu et al., 2003). In this case, the segregation of markers is conducted entirely in vitro by randomly breaking DNAs and subpartitioning the DNA into smaller samples. The frequency of two markers being retained in any sample is proportional to their relative physical distance (as in RH maps). Thus, HAPPY mapping, like genetic maps and RH maps, relies upon statistical methods for marker ordering.

Maps have guided the genomic DNA sequence assemblies of many eukaryotic organisms. Sometimes, the phylogenetic relationship of a particular organism is so close to an already existing map from another species that organization of assembled sequences is relatively straightforward. One example of this would be the relationship of chimpanzee sequence assemblies to the sequenced human genome. However, even in this case, when utilizing known syntenic and karyotypic relationships between the two species, care was exercised so as to avoid “humanizing” the chimpanzee sequence. In general, some form of framework map is required as a scaffold on which to assemble the final consensus sequence, and this map becomes very important when duplications or gaps must be resolved. High-quality physical maps require the ordering of reference markers along a definable linear track. Frequently, this track has consisted of restriction enzyme cleavage points. Two predominant techniques have been employed in these types of maps; clone-based restriction enzyme fingerprinting methods and, to a lesser extent, optical mapping of restriction fragments. Fingerprint maps have aided the accurate sequence assembly of the human, mouse, rat, and chicken genomic DNA sequences (Meyers et al., 2004; Wallis et al., 2004). In essence, they rely upon inferring large-insert clone overlaps and relationships by matching patterns in the lengths of various restriction enzyme digestion products. Optical maps, which depend upon stretching out DNAs and visualizing the length and order of various restriction enzyme digestion sites, have been helpful in quickly establishing genome order for relatively small genomes (Zhou et al., 2003) but suffer from the inability to actually archive a given stretch of DNA for further analysis. Whatever mapping method is employed, the lessons to date indicate that for most large genomes, once a draft genomic DNA sequence is achieved, it is necessary to validate the assembly by reconciling marker orders with some form of reference physical map.

4. Mapping informatics

For practitioners of map building and the use of maps in the genetic dissection of complex phenotypes, the main objective of informatics must be to present a great deal of information in a simple visualizable format. Typically, there are two definable stages associated with the positional cloning of candidate genes for a trait of interest. At early stages, a framework map of some type is routinely required to navigate even closer to the variant allele or haplotype block. Once an interval is roughly defined, additional genetic mapping accelerates progress toward higher interval resolution and potential candidate gene isolation. There are now numerous algorithms and software choices for how to accomplish this phase for quantitative trait loci in experimental model organism crosses, human pedigrees or human populations (see http://linkage.rockefeller.edu/soft/list.html for examples).

Comparative genomic analysis has become particularly important and powerful in defining genes and putative regulatory elements across species. There are now many such tools available for human and mouse genetics (many of which can be accessed through NCBI). One much favored entry point to these tools would be the UCSC genome browser and associated tools. However, many other software tools exist for the comparative evaluation of genetic and physical maps. For example, the package CMap (an extension of the genome model organism database (GMOD) project (http://www.gmod.org)) was originally written for map integration of various plant species, but it has pushed investigators toward common platforms, thus allowing experiments to quickly move from an in silico analysis to the bench or to the field. Likewise, the availability of FPC software has been pivotal in advancing the construction of fingerprint maps in numerous species (e.g., Soderlund et al., 1997). Improvements to FPC and additional independent mapping software have moved the physical mapping process from labor-intensive to highly automated in a short period of time (see http://www.bcgsc.ca/bioinfo/software/ for software examples). For newly sequenced genomes, we expect that software development will continue to play a pivotal role in integrating map and sequence assemblies.

5. Conclusions

It is clear that maps have played a key part in assembling the linear order of sequenced genomes. They continue to provide the framework on which the genetic dissection of complex phenotypes is based. However, just like geographic map projections of the world, they can contain distortions and ambiguities. Among these are the degree to which duplications, rearrangements, and deletions occur as common events throughout many genomes (including our own). Maps will continue to evolve and improve, in combination with overlayed information on transcription, regulation, and epigenetic levels of genomic control. It is hoped that the lessons of the past, in which even low-resolution framework maps played significant roles, will continue to be applied to future large-scale genomic DNA sequencing projects.