Hierarchical, ordered mapped large insert clone shotgun sequencing (Genomics)

The underlying concept of employing dideoxynucleotides as chain terminators in the DNA sequencing reaction, to create a replicated nested fragment set that is size fractionated and detected, has changed little since it was first reported by Sanger et al. in 1970 (Sanger et al., 1977). In contrast, the detailed methods implemented in the laboratory to create, resolve, and detect the actual dideoxynucleotide sequence data have improved greatly owing to the discovery and use of improved DNA polymerases (Chien etal., 1976; Saiki etal., 1988), the development of automated electrophoresis instrumentation (Ansorge and Barker, 1984; Smith et al., 1986; Karger etal., 1991), and the availability of highly sensitive fluorescently labeled nucleic acid derivatives that can be automatically detected efficiently after laser excitation (Bauer, 1990). Although all of these methodological improvements were important, it was the introduction of commercially available automated DNA sequencing instruments, and a significant influx of massive public and private sector funding (Choudhuri, 2003) over the last decade, that paved the way for the yearly almost log-scale increases in the amount of DNA sequenced data collected and a parallel significant reduction in DNA sequencing cost. As a result, a paradigm shift evolved that increased the emphasis on approaches and methods to generate and assemble large target DNA sequences, rather than the actual DNA sequencing data collection process. Clearly, the latter still remains important as improvements continue to be made through the introduction of newer DNA sequencing instruments, several of which are described in this section, as well as significant improvements in DNA sequence assembly programs, as described in other chapters. However, as with almost all science, changes evolve slowly over time. This was the case for DNA sequencing, which began by directed sequencing of restriction digested and subcloned short target sequences and that subsequently evolved into a hierarchical map-based approach to sequence larger genomic regions and then into the shotgun sequencing the underlying minimal tiling path ordered large insert clones combined with more directed closure and finishing phases. These methods now have evolved into widespread implementation of whole-genome shotgun sequencing and assembly to order and orient contiguous but gapped sequences without much attention to the closure and finishing of the entire genome.


Initially, as the DNA sequencing data collection technologies evolved over the past decade, several groups also focused on developing strategies for obtaining the target DNA that subsequently was subjected to the sequencing process. Here the genomic DNA either was cleaved by enzymatic or physical methods and shotgun libraries were produced using various host/vector systems. Cosmid and yeast artificial chromosome (YAC) vectors (McCormick et al., 1987; Burke and Olson, 1991) initially were employed for this purpose, and hybridization methods were used to determine which cosmid or YAC clone(s) encoded the target region of interest (Feinberg and Vogelstein, 1983). When multiple, adjacent probes were used, either created as PCR products amplified off end-sequenced or fully sequenced cosmids (termed over-goes) or from sequencing fragmented YACs, it also was possible to overlap the cosmids and/or YACs and to generate a tiling path covering regions of genomic DNA several orders of magnitude larger than that covered by the initial cosmid or YAC. Although a valuable approach, using these hybridization approaches to completely sequence a large genomic region through making a tiling path of a large number of target clones was both time consuming and often prone to errors that could be traced to the specificity of the hybridization probe used. In addition, both YAC and cosmid vector systems had the tendency to either lose portions of the inserted DNA or otherwise rearrange it since there was little selective pressure to accurately maintain the originally cloned genomic DNA fragment. Therefore, more stable host/vector systems were developed including, namely, bacterial artificial chromosome (BAC)-based clones that could contain between 100 000 and 200 000 bp of genomic DNA insert (Shizuya et al., 1992) and fosmid-based cloning vectors that typically contained ~40 000 bp of inserted genomic DNA (Kim et al., 1992). Since both types of clone libraries were engineered so that they were much less prone to deletions or rearrangements, improved methods to generate tiling paths for large segments of genomic DNA were now possible.

Thus, the hierarchical map-based approach needed to complete the sequence of large reference genomes, for example, flies, worms, humans, and mice, necessitated the development of BAC fingerprinting methods to create a tiling path of overlapping individual clones that then could be used to generate a minimal smaller set of BAC clones for eventual sequencing. Initially, these physical maps were constructed using high-throughput polyacrylamide gel electrophoresis to separate the restriction enzyme-digested BAC clone DNA followed by visualization using a fluorimager, followed by normalizing the band values and gel traces by editing the digitized images (Sulston et al., 1989). More recently, capillary electrophoresis of fluorescent-labeled DNA restriction digests has resulted in a more automated process by which thousands of BACs from a library can be rapidly fingerprinted (Ding et al., 1999; Ding et al., 2001). In either case, the resulting visualized and normalized restriction digestion patterns then are compared and overlapped via computer-based methods such as FPC (Marra et al., 1997) in which the clones are ordered into tiling paths on the basis of the occurrence of shared bands.

Once a minimum tiling path is obtained, the DNA from the underlying BAC clones is isolated and subjected to shotgun sequencing. This process entails breaking a large target DNA randomly into smaller fragments that then are cloned into a vector. Initially, m13 phage vectors (Messing etal., 1977) were used for this purpose, but today double-stranded pUC-based plasmid vectors (Vieira and Messing, 1982) are used almost exclusively as both ends of the cloned insert can be more easily sequenced from the plasmid than from the single-stranded phage vector. After end sequencing, overlapping identical sequences are assembled to recreate the sequence of the original sequence of the BAC-cloned insert. This process is analogous to reconstituting the front page of a daily newspaper by putting thousands of copies of it through a shredder and then overlaying the pieces with similar words and pictures to give a single copy of the initial page.

The initial description of shotgun cloning was given by Steve Anderson in 1981, when he described the cloning of the products of a partial DNAse 1 digestion of a 4257-bp target fragment of the bovine mitochondrial genome into M13 vectors (Anderson, 1981) followed by randomly picking subclones and obtaining the end sequences of each of them. The resulting overlapping sequences then could be assembled into a final, contiguous, consensus sequence representing that of the initial DNA target fragment. This shotgun technique took several years to become widely accepted because the high number of DNA sequencing reactions and subsequent polyacrylamide gel-generated sequences that had to be manually read were both too expensive and too highly labor intensive. It was not until almost a decade later that two independent laboratories, Lee Hood’s group at Cal Tech (Smith et al., 1986) and Ansorge’s group at the EMBL laboratory in Heidelberg, Germany (Voss et al., 1990), introduced automated DNA sequence data collection methods that resulted in the first commercially available fluorescent-based DNA sequencers that were produced by Applied Biosystems and Pharmacia, respectively, in the early 1990s. The major advantage of these fluorescent-based DNA sequencing instruments was that the data collection process was automated. However, since the fluorescent-labeled reactions produced weaker fluorescent signal than the radioactive-labeled reactions, they required higher amounts of single-stranded DNA templates and fluorescent-labeled primers to produce the required signal strength during a constant temperature incubation. The later introduction of thermostable DNA polymerases allowing reaction temperature cycling, termed “cycle sequencing” (Murray, 1989), and fluorescent-labeled dideoxynucleotide terminators, eventually made it possible to use much less DNA template in a single reaction. This, when coupled with the automated data collection on slab gel-equipped instruments, ensured that the shotgun sequencing approach truly became widely accepted. More recently, the introduction of capillary-based DNA sequence data collection instruments, by Applied Biosystems, Molecular Dynamics, and Beck-man, that have shorter runtimes and automated sample loading than previous slab gel-based machines, resulted in the elimination of the labor-intensive sequencing reaction pipetting and data collection steps.

This chapter includes descriptions of the work of several groups that have resulted in sequencing large numbers of DNAs from both higher eukaryotes and microbial genomes,as well as a discussion of sequencing template preparation methods (see Article 4, Sequencing templates – shotgun clone isolation versus amplification approaches, Volume 3) and a description of robotics and automation techniques (see Article 5, Robotics and automation, Volume 3). These articles are followed by contributions from three of the leading groups in developing the next generation of high throughput DNA sequencing methods that include microelectrophoresis devices for DNA sequencing (see Article 6, Microelectrophoresis devices for DNA sequencing, Volume 3), single molecule array-based sequencing (see Article 7, Single molecule array-based sequencing, Volume 3), and real-time DNA sequencing.

Next post:

Previous post: