Fingerprint mapping (Genomics)

1. Introduction

Physical maps constructed from fingerprinted clones have been widely used in genomic research, for both genome-wide and region-specific analyses. As with other clone-based physical map construction strategies, one starts with a library of randomly arrayed clones, each clone containing an unknown fragment of DNA derived from the genome of interest, and identifies experimentally clone relationships that describe their proximity to one another in the intact genome. On the basis of these established relationships, an ordered set of overlapping clones representing the underlying genome is generated. In fingerprint map construction, these clone relationships are determined by comparing the characteristic patterns of DNA fragments generated by restriction digests of the cloned DNA (the clone “fingerprint”). Any two clones sharing a large fraction of their DNA are expected to have very similar fingerprint patterns. Therefore, by comparing the similarity of fingerprint patterns of all clones within a set of fingerprinted clones, those with significant similarity can be inferred as being derived from DNA from overlapping segments of the genome. The number and pattern of shared restriction fragments allows the clones to be ordered with respect to each other, thereby reconstructing contiguous regions of the genome. Because they are clone based, these maps provide a sequence-ready resource for genome sequencing efforts (see Article 8, Genome maps and their use in sequence assembly, Volume 7) and an entry point for cloning and functional analysis of genes of interest. They represent and elucidate the underlying structure of the genome being studied and can be integrated with other genomic and genetic data, such as genetic markers and genomic or gene-based sequences, allowing correlation with whole-genome sequence assemblies as well as other types of genome maps, including cytogenetic maps, genetic linkage maps (see Article 15, Linkage mapping, Volume 3), radiation hybrid (RH) maps (see Article 14, The construction and use of radiation hybrid maps in genomic research, Volume 3), and sequence tag site (STS) maps (see Article 13, YAC-STS content mapping, Volume 3). For an overview of genome mapping, see Article 9, Genome mapping overview, Volume 3. The specific details of fingerprint map construction will be discussed here, beginning with a description of how this approach to physical mapping evolved.

2. Origins of fingerprint mapping

2.1. Molecular tools derived from basic research

Fingerprint mapping is the determination of the relative positions of restriction endonuclease sites along a DNA molecule. The concept of restriction mapping is therefore by definition contingent on the existence of restriction endonucleases. Thus, a critical step in the evolution of this mapping technique was the discovery, isolation and characterization of these enzymes. Evidence of the existence of restriction enzymes was first observed in the early 1950s through the phenomenon of host-controlled variation, in which the ability of bacterial viruses to reproduce in certain host strains was dependent upon the host in which they had previously reproduced. This mechanism of host specificity in Escherichia coli was found to involve both DNA modification and DNA restriction activities. In 1968, a restriction enzyme from E. coli K, active against X DNA, was the first restriction endonuclease to be highly purified and characterized. Purification and characterization of additional restriction enzymes rapidly followed (for early reviews, see Meselson etal., 1972; Nathans and Smith, 1975).

The potential of using restriction endonuclease digestion to characterize genomic DNA was first demonstrated in the early 1970s on SV40 DNA and the replicative form of 0X174. These studies showed that specific viral DNA cleavage products could be generated by endonuclease digestion, that these products could be separated by polyacrylamide gel electrophoresis and individually identified, and that the number and size of the fragments produced could be used to characterize the viral DNA. The rationale behind these DNA cleavage experiments was twofold; (1) specific fragmentation of viral DNA chromosomes could potentially be used to generate small, unique DNA fragments that would be amenable to sequencing and (2) if such specific fragments could be produced, then the potential existed to order them with respect to each other and therefore provide a framework (i.e., a physical map) on which to map the location of specific genetic functions in the viral DNA. Indeed, several known 0X174 activities had been successfully mapped to specific restriction fragments identified in the initial cleavage study. The DNA fragment patterns derived from restriction endonuclease digestion and electorphoretic separation (i.e., the fingerprint) were additionally found to be sufficiently sensitive and reproducible that they could be used to distinguish between different strains of SV40 (Nathans and Danna, 1972), in what may possibly be the first comparative genome mapping experiment.

The first genome restriction map was generated for the SV40 genome (Danna etal., 1973), using partial DNA digestion with restriction endonuclease isolated from Haemophilus influenzae and subsequent complete digestion of the partial digest products with two additional restriction endonucleases. This resulted in a circular map composed of the relative positions of the cleavage sites within the DNA molecule. Using similar techniques, restriction maps for the genomes of a number of other small DNA viruses were also constructed, including those of the polyoma virus, X, 0X174, and adenovirus. A simple method for fragment separation on agarose gels and visualization using ethidium bromide was also developed during this time (Sharp et al., 1973). Restriction mapping of DNA molecules became a standard method for the direct characterization of small DNA chromosomes. The fundamental reagents and techniques required for fingerprint mapping had thus been established, using in part molecular tools that had been developed as a result of unrelated research into the mechanisms underlying bacterial host-pathogen interactions.

2.2. From viruses to humans: fingerprinting large genomes

The large size of bacterial and eukaryotic chromosomes, and the number and size of restriction digest fragments generated from these larger DNA molecules, made direct application of the restriction mapping techniques developed for the smaller viral DNA genomes problematic. Two primary technological advances provided the means to fingerprint map these large DNA molecules; the development of pulsed-field gel electrophoresis (PFGE) (Schwartz and Cantor, 1984) for the separation of large DNA fragments, and the development of recombinant DNA technology (Jackson etal., 1972; Cohen etal., 1973) to reduce large segments of genomic DNA into a number of smaller, more easily manipulated cloned genomic fragments.

These technologies led to the development of two approaches to fingerprint map large regions of DNA. In one method, described as a “top-down” or landmark mapping approach, intact genomic DNA (i.e., a whole chromosome) was digested with enzymes that cut rarely in the genome, generating large DNA fragments that were then separated by size on agarose gels by PFGE. These fragments were typically mapped relative to each other by hybridization of DNA probes, such as gene-based sequences or probes specific for restriction fragment ends. Because these restriction endonuclease recognition sites occur infrequently within the DNA sequence, this fingerprinting method generates a long-range but low-resolution “macrorestriction” map of the genome. Restriction maps for the genomes of E. coli (Smith et al., 1987), Saccharomyces cerevisiae (Link and Olson, 1991), and Schizosaccharomycespombe (Fan et al., 1989) were generated using this approach. However, since this method requires the isolation of intact chromosomal DNA and the separation and detection of all fragments generated from a restriction digest of this DNA, it was not particularly well suited to the mapping of larger eukaryotic genomes. Additionally, it did not provide reagents that could be readily applied to functional studies or to sequencing strategies.

The second method utilized a “bottom-up” approach, in which many copies of the genome were first fragmented into smaller pieces of DNA, cloned into a bacterial vector and propagated in a suitable bacterial host. These smaller DNA fragments were easily isolated with standard molecular procedures and were amenable to restriction fingerprinting using the same general techniques applied to the viral DNA genomes. Thus, the restriction fingerprinting task was transformed from application to an entire eukaryotic chromosome to that of a series of easily manipulated DNA fragments. This approach was therefore more suited in terms of high-throughput laboratory techniques than the top-down approach, with the additional benefit of providing higher resolution due to the increased density with which restriction sites could be sampled along the DNA. It does, however, represent a more complex task in terms of assembling a global fingerprint map from the individually fingerprinted DNAs (see Article 19, Restriction fragment fingerprinting software, Volume 3 and Article 1, Contig mapping and analysis, Volume 7). A variety of different strategies employing this basic approach have been used to construct fingerprint maps for eukaryotic genomes. The application of this methodology to the construction of whole-genome fingerprint maps was pioneered in the model organisms Caenorhabditis elegans (Coulson et al., 1986) and S. cerevisiae (Olson etal., 1986). The approach was soon employed in the generation of maps for other model organisms, including those of E. coli (Kohara et al., 1987; Knott et al., 1989), Arabidopsis thaliana (Hauge et al., 1991; Marra et al., 1999), and Drosophila melanogaster (Siden-Kiamos et al., 1990; Hoskins et al., 2000). Large regions of human chromosomes were also mapped with a variation of this approach (Carrano etal., 1989; Marra et al., 1997). Ultimately, as the molecular and computational techniques employed in random-clone fingerprinting and map assembly matured, a clone-based fingerprint map for the entire human genome was achieved (McPherson etal., 2001). Fingerprinted clone maps have been constructed for a number of additional mammalian and plant species, including those for the laboratory mouse (Gregory etal., 2002), laboratory rat (Krzywinski etal., 2004), rice (Tao et al., 2001), and maize (Cone et al., 2002) genomes. These maps have played, and continue to play, important roles in genome sequencing efforts. For more information on the use of physical maps in genome projects, (see Article 3, Hierarchical, ordered mapped large insert clone shotgun sequencing, Volume 3, Article 24, The Human Genome Project, Volume 3, and Article 8, Genome maps and their use in sequence assembly, Volume 7). The remainder of this review will discuss more specifically the current strategies used for fingerprint map construction that have evolved from the pioneering work of the past 30 years.

3. Fundamentals of fingerprint map construction

3.1. Overview of the fingerprinting process

The bottom-up approach for constructing fingerprint maps, also referred to as a contig-building strategy, can be divided into a fingerprint data generation (wet-lab) component and a contig construction (computational) component. The process is outlined in Figure 1, and encompasses the following basic steps: (1) construction of a large-insert clone library representing many copies of the genome, (2) DNA purification and restriction endonuclease digestion of a number of clones that together represent redundant coverage of the genome, (3) size separation of the restriction fragments by electrophoresis, (4) restriction fragment detection and size determination, (5) comparison of restriction fragment patterns between all clones to determine similarity, (6) assembly of clones with highly similar restriction fragment patterns into groups of ordered, overlapping clones (referred to as “contigs”), and (7) comparison of fingerprint patterns between clones at contig edges to identify moderate but still significant similarities, indicating joins between individual contigs, and thereby constructing larger contiguous regions of the genome. The end result of this process is a physical map represented by sets of ordered, overlapping clones. Depending on the fingerprinting technique used, the map may also reflect the underlying restriction fragment map of the genome. Assembly of the fingerprint data into contigs (steps 5-7, above) is performed with the assistance of a program called Fingerprint Contigs (FPC) (Soderlund et al., 1997; Soderlund et al., 2000). The details of the computational aspects involved in using clone fingerprint data to assemble contigs is described elsewhere (see Article 19, Restriction fragment fingerprinting software, Volume 3 and Article 1, Contig mapping and analysis, Volume 7) and will not be covered here.

Figure 1 Overview of clone fingerprint data generation and map construction. The two components are shown, fingerprint generation on the left and map construction on the right. Fingerprint generation (a-d): (a) Generation of a large-insert clone library that represents the genome at a high level of redundancy. (b) Clones are sampled randomly from the library and digested with restriction endonuclease, here illustrated with the enzyme HindIII, with recognition sequence A|AGCTT. (c) Size separation of the restriction fragments by electrophoresis. Stylized data are depicted, with electrophoresis progressing from left to right. Top, chromatogram derived from fluorescently labeled fragments separated on automated sequencer; middle, fragments separated on an agarose gel and visualized with fluorescent DNA dye; bottom, actual restriction fragments. (d) Fragment detection and size determination. Each detected fragment is denoted with f n where n indicates a particular fragment size. Note that multiple fragments of the same size can be detected on agarose gels. Size determination is made by comparison and interpolation to the fragment pattern of an analytical marker (not shown), composed of DNA fragments of known size. Map construction (e-g): (e) Fingerprint data are stored as an ordered list of fragment sizes and/or mobilities for each clone (depicted here as a size ordered set of fragments). Comparison of fingerprints between all clone pairs is first performed to determine the similarity of fragment patterns. (f) Clones with highly similar fingerprint patterns (depicted on the right) are grouped into ordered sets representing overlapping clones (depicted on the left). Order within the contig is deduced by the progression of ordered fragments across fragment patterns (right, bottom). (g) Map contiguity may be increased by subsequent comparison of fingerprint patterns between clones at contig edges (depicted on the left) to identify moderate but still significant similarities that can join contigs into a single structure (depicted on the right)

One might expect at the end of this contig-building process that each chromosome will be represented by a single contig; however, in practice, this is not achieved owing to the effect of a number of technical factors that may each contribute to varying degrees, including reduced representation (or lack of representation) in the library of certain genomic regions, lack of genome coverage as a result of the random sampling approach, and unrecognized clone overlap. These are discussed in more detail below.

3.2. Fingerprinting methods

There are two basic clone fingerprinting techniques that have evolved from the early work in C. elegans and S. cerevisiae, differentiated primarily by the method in which restriction fragments are separated and detected. In one method, restriction fragments are resolved by size on agarose gels and detected by staining with a sensitive DNA dye. In the other method, fragment separation is achieved using polyacrylamide electrophoresis and fragments detected via either radioactive or fluorescent labels.

The agarose gel-based technique (Marra etal., 1997; Schein etal., 2004) was developed from the method used to construct the S. cerevisiae fingerprint map, and was the first method to be widely applied to genome physical map construction. In this method, clone DNA is digested to completion, typically using a single enzyme with a 6-bp recognition site, and the fragments separated by electrophoresis on agarose gels. Analytical marker standards with known fragment sizes are loaded in frequent intervals along the gel to provide a sizing standard. Restriction fragments between approximately 600 and 30 000 bp can be resolved and reliably detected (Fuhrmann et al., 2003). Essentially, all restriction fragments generated from each clone (typically on the order of 23 fragments per 100 kb for a single enzyme digest) are detected with this method, providing the potential of deriving an ordered restriction map of each contig. In practice, however, the restriction map is only partially ordered, consisting of a series of fragment “bins”, each bin containing one or more fragments. The relative order of the bins is determined, but the order of multiple fragments within a bin is not. The detection of all fragments and their sizes has several advantages: insert sizes can be determined individually for each fingerprinted clone, which can be a useful constraint when including end sequences of the BACs into a genome sequence assembly or when assessing BAC end sequence alignments to a genome sequence assembly; the estimated size of the overlap between any two clones can be calculated directly by summing the size of shared fragments detected, which has practical application when selecting from a contig a tiling set of clones, for example, a minimal tiling set of clones for sequencing or for representation on a genome array (see Article 16, Microarray comparative genome hybridization, Volume 3); verification of sequence assembly accuracy can be performed by comparison between experimental fingerprint fragments and an electronic digest of the corresponding sequence, which can be particularly useful in detecting collapses in the assembly due to the presence of repetitive sequences.

The polyacrylamide-based fingerprinting techniques currently used were developed from the method used to construct the C. elegans fingerprint map. In this method, fragments are separated by electrophoresis on automated sequencers, either slab-gel based or, more commonly now, capillary based. Only those fragments that fall within a size range of approximately 70-500bp are detected, and multiplets are not reliably detected. In order to generate a sufficient number of fragments within this size range, the DNA is digested with two or more enzymes. One of the enzymes cuts frequently within the genome and leaves a blunt end. The other enzymes typically have 6-bp recognition sites and leave an overhang. The vast majority of the resulting fragments have one blunt end and one end with an overhang, and detectable fragments represent approximately 15% of the clone DNA. The fragments are labeled at the 6-cutter end with one or more fluorescently labeled dideoxy nucleotides. There are several variations of this approach. In one method, a single 6-cutter enzyme is used, either Type II (Gregory et al., 1997) or Type lis (Ding et al., 2001) and a single labeled nucleotide is added. A number of fragments similar to that with the agarose method are detected. In an alternative for the latter approach, the overhang is fully sequenced (Ding et al., 2001), linking several bases of sequence information to each detected fragment. In a second method, four different 6-cutter enzymes are used, each labeled with a different fluorescent base (Luo et al., 2003), which adds restriction enzyme site information to the fragment size for each detected fragment. This method generates on the order of 78 fragments per 100 kb. One advantage of these methods over the agarose method is increased sizing accuracy, which is typically on the order of 1 bp. The increased number of fragments and added information content of two of these methods also provides the possibility of detecting smaller clone overlaps than with the agarose-based method, which may result in greater map contiguity.

3.3. Factors affecting genome representation in clone libraries

Genomic clone libraries are typically constructed from genomic DNA that has been fragmented by partial restriction endonuclease digestion. The distribution of restriction enzyme recognition sites within a particular genome is therefore an important consideration prior to selection of an enzyme for use in library construction. If there exist regions in a genome where the distance between neighboring recognition sites for a particular enzyme is greater than the maximum fragment size that can be cloned, then these regions will not be represented in a genomic library constructed using that enzyme. If a single restriction endonuclease suitable for partial digestion of the DNA cannot be identified, then construction of two or more libraries, each generated using a different restriction enzyme, can compensate to some extent if the distribution of restriction sites for each enzyme within the genome is complementary (e.g., enzymes with different G/C content in their recognition sequences). Analysis of the fragment size range generated by a complete digestion of the genomic DNA with a candidate enzyme can indicate whether there are regions of the genome that will not be cloned. The size limit of cloneable fragments is of course dependent on the vector selected for library construction. Bacterial artificial chromosome (BAC) vectors (Shizuya et al., 1992) are currently the vectors of choice for constructing large-insert genomic libraries for purposes of restriction fingerprint mapping. BAC vectors are capable of cloning segments of foreign DNA of up to 300 kb, although insert sizes generally range from 100 to 200 kb. The cloned DNA is stably maintained, the rate of chimeric constructs is very low, and the clones are easily manipulated in the laboratory. However, there may be genomic sequences that are not readily cloned or easily propagated within bacterial hosts (e.g., heterochromatic DNA), and this can result in some bias in genome representation in a library.

3.4. Redundant genome sampling in a random-clone approach

In a random-clone fingerprinting strategy, clones from a genomic library are arrayed and sampled at random, with no a priori knowledge of where the clone inserts originated in the genome. Each successive clone that is sampled from a library may represent a completely unique region of the genome or it may overlap in whole or in part with one or more previously sampled clones. The first clones sampled from a library each has a high probability of representing a unique region of the genome, so the rate at which unrepresented regions of the genome is sampled with each additional clone is high. As the number of sampled clones increases, the probability decreases that each additional clone contains previously unsampled, unique DNA, and the rate at which unrepresented regions of the genome are sampled begins to decrease with each additional clone. In order to achieve complete, or nearly complete, representation of the genome in a random-clone approach, it is therefore necessary to sample many more clones (redundant sampling) than would be required to represent the genome if the clones were simply laid end to end.

The level of redundant sampling undertaken for a fingerprint mapping project is a function of the desired level of genome representation, the fraction of shared DNA between clones that is required to detect true clone overlaps (i.e., the sensitivity of overlap detection), and the relative number of contig gaps that is deemed acceptable. Given a truly nonbiased, randomly arrayed clone library, approximately fivefold genome redundancy (5X coverage) is necessary to provide substantially complete representation of a genome (Michiels etal., 1987). At fivefold redundancy, on average each nucleotide is represented in five different clones or, put another way, each clone overlaps on average with four other clones. This would roughly equate to 80% shared DNA between adjacent clones in the genome, a relatively substantial overlap. However, this is a calculated average, which means that half of the adjacent clone pairs will overlap by something less than 80%. Thus, for example, if 80% shared DNA is the minimum amount of overlap required to differentiate between true clone overlaps and false-positive overlaps during fingerprint contig assembly, half of the adjacent clone pairs in the genome will fail to satisfy this requirement. This will result in a large number of contig gaps in the assembly due to undetected clone overlaps. To minimize the number of contig gaps, the effective genome coverage in sampled clones must be increased to a depth that ensures that the majority of the genome is represented by adjacent clone pairs that overlap by the required amount.

3.5. Clone overlap detection and contig gaps

For any particular fingerprinting project, the level of redundant clone coverage required is dependent on both the size of the genome and the sensitivity of detection of clone overlap, the latter of which is based on fingerprint similarity and is a function of clone size and fingerprinting technique. Clone overlap is essentially calculated as the relative proportion of common fragments shared between two clone fingerprints. Since larger genomes require more clones to represent them than do smaller genomes, the probability that there are two unrelated clones sharing by chance a certain number of fragments of the same size is also increased. Thus, as the size of the genome increases, the likelihood of detecting false-positive overlaps given a particular requirement for clone similarity also increases. The required amount of calculated overlap between two clones that is accepted as representing true overlap for purposes of contig construction must therefore be increased for large genomes relative to smaller genomes, and this will affect the level of redundant coverage selected. Mathematical descriptions and analyses of the various effects of these factors have been described (Lander and Waterman, 1988; Branscomb et al., 1990). For fingerprint maps of mammalian-sized genomes, a number of clones representing 10-15X genome coverage are typically fingerprinted.