Eukaryotic genomics

1. Introduction

In the early 1990s, a plethora of strategies for genome sequencing were proposed as part of the initial phases of the Human Genome Project (HGP). Each strategy relied to varying extents on three types of maps: (1) genetic and radiation hybrid maps consist of sequence tagged site (STS) markers of known order throughout the genome that can be used as landmarks (see Article 14, The construction and use of radiation hybrid maps in genomic research, Volume 3 and Article 15, Linkage mapping, Volume 3), (2) physical maps are composed of overlapping cloned regions of the genome that are tied to the landmark maps and can be used as the physical source of DNA for sequencing a segment of the genome (see Article 9, Genome mapping overview, Volume 3 and Article 18, Fingerprint mapping, Volume 3), and (3) the sequence map, which is the genome sequence itself. So many factors underlie these interconnected maps that the initial plan for the HGP chose to defer decision making on many aspects of the latter two maps until they could be developed together as technology for collecting DNA sequence improved and lessons had been learned from model organism sequencing projects.

By the end of the 1990s, genome sequences were in hand for Saccharomyces cerevisiae (Goffeau et al., 1996; Goffeau et al., 1997), Caenorhabditis elegans (C. elegans Consortium, 1998), and several bacteria and archaea. Sequencing technology had improved dramatically and a few large laboratories were running scores of automated DNA sequencers and producing thousands of high-quality DNA sequence lanes per day. With the introduction of 96-channel capillary sequencers in 1998, the rush was on to capitalize on even greater sequencing capacity to get the human genome sequence done. Two strategies were chosen by Celera Genomics and the International Human Genome Sequencing consortium. These two strategies and several blends between them continue to be used for the sequencing of other eukaryotic genomes. A brief analysis of the “whole-genome shotgun” and “hierarchical shotgun” method will provide an introduction to this section on Genome Sequencing.

2. Sequencing and assembly approaches

2.1. Hierarchical shotgun

Traditional genome-sequencing methods (Blattner, etal., 1997, Goffeau, etal., 1997, C. elegans Consortium, 1998) have relied on making a carefully constructed map of genome subclones, sequencing each subclone, and then reassembling the complete genome by piecing together the subclone sequences. The maps are generally constructed by a combination of marker-driven methods (probing a subclone library with short sequences such as STSs) and fingerprint methods (restriction digest patterns of clones are compared to one another to identify overlapping clones). A set of subclones is then chosen for sequencing on the basis of selecting the smallest number of clones that reliably covers the genome. The advantage of this approach, termed “hierarchical shotgun sequencing”, is that there are several opportunities for checking the quality of the map as it progresses (i.e., do the fingerprint and marker order data agree?). A second advantage is that the map itself has value since the clones can be used for follow-up study. The disadvantage is that it is laborious and difficult to automate. It is also highly dependent on the nature of the subclones that are used. Early map-building efforts suffered from clones that were unstable at both ends of the size spectrum: yeast artificial chromosomes (YACs, 0.5-2 Mb) and cosmids (30 kb). Bacterial artificial chromosomes (BACs, ~150kb), which are carried as single copy in Escherichia coli, have proven to stably clone most segments of the human genome, and the hierarchical shotgun strategy relied to a large extent on BACs for sequencing the human genome. Each BAC was then sub-cloned into small (~2 kb) fragments that were sequenced. Assembly of each BAC was followed by a second assembly step that joined all of the BAC sequences together on the basis of information from the map, thus the term “hierarchical” shotgun assembly. The BAC clones have also served as a distributed platform for finishing the human genome sequence to high quality, with different clones completed and checked at dozens of laboratories throughout the world (IHGSC, 2004).

2.2. Whole-genome shotgun

With the development of the Applied Biosystems 3700 Automated DNA Analyzer, the speed and accuracy with which raw DNA sequence data could be obtained increased dramatically. This forced a shift in thinking away from the map-based approaches toward a whole-genome shotgun strategy that would take maximum advantage of the increased output of raw sequence data. The whole-genome strategy relies on computational algorithms rather than extensive map-building to reassemble the genome sequence from the raw data (Weber and Myers, 1997). In the whole-genome shotgun strategy, the entire genome is sheared into small to medium fragments (~2kb or ~10kb); these are sequenced directly. By sequencing both ends of each subcloned fragment, the two sequences are constrained to be adjacent to one another in the genome; these clone end sequences are called mate pairs. A sufficient number of fragments are sequenced to represent the genome 5-10 times. This 5X to 10X coverage means that most DNA bases have been sequenced many times, but a small fraction is missing because the coverage is random. The whole-genome strategy poses two problems: the shear number of fragments (about 30 million for 5X coverage of the human genome) and the presence of repetitive DNA. The first problem is largely computational – data management, data structures, and assembly algorithms have been developed to effectively organize and handle the quantity of data (Myers, et al., 2000, Batzoglou, et al., 2002). The second problem is more complicated and has implications for sequencing all eukaryotic genomes.

If the 2.9 billion base-pair sequence of the human genome were composed of a random distribution of the four DNA bases, any given sequence of at least 12 bases (412) would be highly likely to be unique in the genome. The 500 to 700 bases in a typical sequence fragment from an automated sequencer have more than enough information content to be unique in the human genome. The problem then is not the size of the genome but the presence of highly similar sequences at more than one location in the genome. There are several categories of these repeated sequences ranging from the ~300-bp Alu element (100 000 copies) to duplications of up to several megabases that are frequent around the centromeres of the chromosomes (see Article 26, Segmental duplications and the human genome, Volume 3). The length and identity of the repetitive elements determines the level of difficulty that they add to the assembly process. Repeats that are longer than the typical sequence read length and more similar than about 98.5% along their entire length are difficult to assign to their correct chromosomal location.

By obtaining mate-pair sequence from clones of several insert sizes, it is often possible to identify unique sequence that jumps across or spans repeats that are shorter than the average clone length. This approach can resolve the most common types of repetitive elements in the human genome. The mate pairs also serve to anchor together adjacent sequence contigs, resulting in long chains of correctly ordered sequence, with the gaps between contigs spanned by subclones. Additional computational techniques have been developed that attempt to improve on assembly in repeat-rich regions, especially relying on detection and classification of repeats, use of error-correction, and use of signature differences to separate repeat copies appropriately. Long tandem arrays of nearly identical repeats at the centromeres and telomeres of chromosomes cannot be sequenced with existing technology and approaches.

3. Prospects for the future

Sequencing of the human genome was a landmark event in the history of science. While a great deal was learned about the structure and content of the genome through an initial evaluation of the sequence, it has become increasingly clear that much more can be learned by comparing the genomes of multiple individuals and comparing the human genome to that of other primates, mammals, and other animals. The genomes of yeast, C. elegans (nematode worm), and Drosophila melanogaster (fruit fly) were obtained as part of the preparation for sequencing the human genome. Following human, the mouse, rat, and chimpanzee genomes have been completed to a “draft” stage. A “draft” genome sequence generally means that about 95% of the genome is covered in reasonably accurate sequence (less than one error in 5000 bases) that is well ordered and mapped to chromosomes. Many additional eukaryotic genome-sequencing projects are either completed, underway, or planned (see Table 1).

Table 1 Eukaryotic genome-sequencing projects

Species Status Genome size (Mb) Sequencing strategy”
Published Human Finished 2900 HS, WGS
Mouse Draft+Finished 2600 WGS
Rat Draft 2700 Hybrid
C. elegans Finished 100 HS
C. briggsae Draft 105 WGS
D. melanogaster Finished 120 WGS
A. gambiae Draft 280 WGS
P. falciparum Finished 23 SCS
S. cerevisiae Finished 16 HS
S. pombe Finished 12.5 HS
Dog Light Draft 2700 WGS
Arabidopsis Finished 125 HS
Rice Draft 400 Hybrid
Neurospora Draft 40 WGS
Fugu rubripes Draft 365 WGS
C. intestinalis Draft 117 WGS
Danio rerio (Zebrafish) Draft+Finished 1600 Hybrid
Cow Ongoing 2900 Hybrid
Honeybee Ongoing 200 Hybrid
D. pseucloobscura Draft 140 WGS
Chimpanzee Draft 2900 WGS
Macaca mulatta Ongoing 2900 Hybrid
C. albicans Draft 16 WGS
Fusareum graminearum Draft 36 WGS
Ustilago maydis Draft 19 WGS
Ciona savignyi Draft 180 WGS
Aspergillus nidulans Draft 30 WGS
Aspergillus fumigatus Draft 35 WGS
Magnaporthe grisea Draft 39 WGS
Coprinus cinereus Draft 36 WGS
Ciyptococcus Draft 19 WGS
neofonnans serotype A
Dictyostelium discoideum Draft 34 SCS
Entamoeba histolytica Draft 20 WGS
Tetrahymena thennophila Draft 100 WGS
Theileria paiya Draft+Finished 9 WGS
Brugia malayi Ongoing 110 WGS
Plasmodium viva.x Draft 30 WGS
Plasmodium yoelli Draft 30 WGS
Pneumocystis carinii Ongoing 7 WGS
Schistosoma mansoni Draft 270 WGS
Toxoplasma gondii Ongoing 80 WGS
Trichomonas vaginalis Ongoing 60 WGS
Tiypanosoma brucei Ongoing 30 Hybrid
Tiypanosoma cruzi Ongoing 40 Hybrid
Sea urchin Draft 800 Hybrid
Tetraodon nigroviridis Ongoing 400 WGS
Kangaroo Ongoing WGS
Chicken Ongoing 1200 Hybrid

°HS: hierarchical shotgun, WGS: whole-genome shotgun. Hybrid: both whole-genome shotgun and map-based clone sequencing used together, SCS: single chromosome shotgun.

Given what has been learned so far, what is the best strategy for sequencing additional large eukaryotic genomes? The choice of sequencing strategy for these organisms will depend on the goals of the sequencing project and on the answers to three primary questions: (1) How closely related is the genome to the genome of another organism that has been sequenced? (2) What is the nature of the repetitive elements in the genome? (3) Will the genome eventually be finished to very high quality? Each of these issues will be addressed in the following paragraphs.

3.1. Comparative sequencing

Increasingly, phylogenetic relatives are being sequenced to assist in the analysis and interpretation of a reference genome sequence (see Article 48, Comparative sequencing of vertebrate genomes, Volume 3). Drosophila pseudoobscura, C. briggsae, and chimpanzee were all selected for sequencing not only on their own merits but by what a comparison of their sequence might reveal about better-studied close relatives. In the case of chimpanzee, the nucleotide identity is so high that virtually every sequence read from the chimp genome can be assigned to a unique corresponding region of the human genome sequence, with the exception of sequences that are chimpanzee-specific. Drosophila pseudoobscura and C. briggsae are more distantly related to their respective references (D. melanogaster and C. elegans) – about the same phylogenetic distance apart as human and mouse. At this distance, most of the nonfunctional sequence is no longer conserved, facilitating identification of genes and conserved regulatory elements. The primary goal of these projects is to identify matching regions in a reference genome, and secondarily to identify the sequence unique to each genome, rather than to construct a high-quality finished sequence. In this case, a whole-genome shotgun strategy is clearly the most efficient way to generate high-quality draft sequence for comparison.

3.2. Repetitive elements

When long, nearly identical repeats are present, and when it is important to correctly resolve those repeat structures, such as for the study of chromosome evolution, a hierarchical or hybrid approach is likely to be the most effective. Whole-genome shotgun data can indicate the presence of repeated sequences (based on excess sequence coverage at those locations), but physically separating each copy of each repeat in BAC clones is the best way of correctly assembling each copy of long identical repeats. In repeat-rich genomes, construction of a BAC map by use of restriction fingerprint patterns can also be quite challenging, necessitating additional laboratory work to confirm both map and sequence.

3.3. Gap closure

Genome finishing - the process of filling gaps and confirming the quality of the entire sequence – is a quite different task from collecting the initial sequence data for a project, regardless of whether a hierarchical or whole-genome strategy is used. Plasmid subclones from each BAC or from whole-genome library used in the initial sequencing phase are selected for additional sequencing if they span a gap in a contig or a low-quality region. Additional finishing techniques involve sequencing of PCR-amplified segments of the genome and direct sequencing of BAC clones. One of the most difficult challenges of genome finishing is in closing gaps where no cloned DNA is present. These so-called physical gaps (because they are not physically present in any of the clone libraries) often result from portions of the genome that are not clonable in the standard cloning vectors that propagate in E. coli. For small genomes, a combination of sequencing subclones from whole-genome shotgun libraries, direct BAC or genomic sequencing, and PCR have been very successful at achieving high-quality genomic sequence. For larger metazoan genomes, where the whole-genome libraries contain millions of clones, finishing has primarily been performed on a BAC-by-BAC basis.

The cost per basepair for genome finishing is easily 50 times the cost of producing the first ~95% of the sequence in draft form. The high cost and technical complexity of producing a finished genome sequence means that there will be many more draft than completely finished genome sequences for the foreseeable future. Methods such as comparative gene-finding programs (Parra et al., 2003; Flicek etal., 2003) that take best advantage of the incomplete information present in draft genome sequences continue to evolve.

4. Conclusion

As the cost of DNA sequencing continues to decline and analytical methods for assembling, annotating, and interpreting genome sequence improve, it is clear that more eukaryotic genomes will be sequenced. In fact, more than three dozen projects are already well along (Table 1) and many more are planned. The wealth of genome sequence data that will result will prove quite powerful for assisting in understanding the evolution of metazoan species, the structure of chromosomes, the sets of functional genes, and the sequences that control their expression.

Next post:

Previous post: