Comparative analysis for mapping and sequence assembly (Bioinformatics)

1. Comparative analysis for mapping and sequence assembly

Comparisons of mammalian genomic sequences reveal extensive similarity at both the chromosome and basepair levels (Mouse Genome Sequencing Consortium, 2002; Rat Genome Sequencing Project Consortium, 2004). The increasing number of assembled reference sequences produced by ongoing genome sequencing projects thus provides information that is potentially useful for the mapping and assembly of related genomes.

Detection of unique orthologous sequence fragments, which are sometimes referred to as “orthologous anchors” or “syntenic anchors”, is a key step in many comparative genomic methods (Mural et al., 2002). The density of anchor sequences and the conservation of their relative order, orientation, and distances is a key indicator of utility of comparative information for mapping and assembly of one organism using the assembled sequence of a related organism as a template.

2. Detection of orthologous anchors

Orthologous anchors may be identified by comparing two assembled genomes, one assembled genome against an extensive collection of contigs from another genome, or by comparing contigs from two partially assembled genomes. (The word “contig” here refers to either to a group of contiguous assembled sequencing reads or to a single read.)

On one hand, the comprehensive nature of all-against-all sequence comparisons of two genome-scale databases is computationally demanding. On the other hand, it allows identification of truly orthologous anchors with high specificity via the application of a reciprocal best match filter (Mural et al., 2002).

The most widely used comparison programs are optimized for fast and sensitive comparison of short queries against a single large database (Altschul et al., 1997; Pearson, 2000). In contrast, for maximum utility of a “reciprocal best match” filter, anchoring ideally requires simultaneous comparison of genome-sized databases. A new generation of comparison programs employ various types of genome-scale indices that are designed to fit within computer memory (RAM), which provides the speed required for genome comparison (Ning et al., 2001; Kent, 2002; Ma et al., 2002; Kalafus et al., 2004). One of these programs was specifically designed for the anchoring task (Kalafus et al., 2004).

3. Comparative mapping of reads and contigs

Anchoring of contigs from one species onto an assembled genome of the other related species provides hypothetical order, orientation, and distance information. This information can be independently confirmed by a variety of methods. For example, mate-pair reads obtained from both ends of clone inserts of known size may bridge the gap between two anchored contigs, thus confirming their putative distance and orientation. Anchored contigs may also be bridged computationally from the read overlap graph by constructing a tiling path of overlapping reads. Alternatively, the bridging may be performed by targeted experiments, such as PCR and primer walking.

There are two general contig mapping scenarios. The first scenario involves one assembled genome and the contig set from another, partially assembled genome (Pletcher et al., 2001). The second scenario involves two genomes assembled at the contig level only, each providing only fragmentary information for bootstrapping each other’s assembly. The algorithmic challenge inherent in the latter method has recently been addressed (Veeramachaneni et al., 2003).

4. Comparative mapping of clones

A variety of methods exists for mapping cloned genomic fragments, such as Bacterial Artificial Chromosomes (BACs), of one species onto an already assembled genome of another. The information may be either used to select clones for targeted comparative sequencing or to infer their hypothetical order, overlap and distance, or for both purposes.

One method for comparative physical mapping starts with a comparison of already assembled genomes to identify regions generally conserved across species. The conserved regions are then used to design hybridization probes for identification of homologous clones from third species. Key assumption is that the sequence conservation between two species is a good indicator of its conservation in the third, related species. The method has been applied to the construction of orthologous clone-contig maps in multiple species (Thomas et al., 2002).

A more direct probe design requires whole-genome shotgun reads from the mapped species. Reads anchored onto an assembled genomic sequence of a related species can be used as intraspecies hybridization probes to identify clones that map onto the anchoring sites. A large collection of mouse BACs has been mapped onto the human genome by this method (Thomas et al., 2000).

An even more direct comparative clone mapping method involves sequencing of clone ends and their anchoring onto the related genome. A clone is mapped by this method if the distance between the two anchoring sites approximates clone insert size. Recent large-scale applications of this method involved BAC end sequencing (BES) of large collections of chimpanzee (Fujiyama et al., 2002) and bovine (Larkin et al., 2003) BACs and their mapping onto the assembled human genome.

Finally, recently proposed Pooled Genomic Indexing (PGI) method (Milosavlje-vic et al., 2005; Csuros and Milosavljevic, 2004) maps BACs of one species onto the genome of another by shotgun sequencing of BAC pools. Pools are designed so that two pools have at most one BAC in common. Consequently, if reads originating from two pools anchor within the same BAC-sized genomic location of a related genome, the BAC that is present in both pools is mapped onto that location. PGI has the potential to achieve significant reduction in cost and efficiency of comparative mapping. In a first genome-scale application of PGI, a library of rhesus macaque BACs is being mapped onto the human genome (Milosavljevic et al., 2005 #716).

5. Comparative sequence assembly

Assembled genomic sequence of one species can be used as a template to guide the assembly of another, related species. In contrast to the independent assembly, where only read overlap information is utilized in the assembly process, comparative assembly maximizes both read overlap and similarity of the newly assembled sequence to a reference template (Milosavljevic, 1999). An information-theory argument shows that the comparative information enables better detection of conserved regions than the comparison of independently assembled genomes (Milosavljevic, 1995).

Comparison of independently assembled genomic sequences frequently results in differences due to either evolutionary rearrangements or to assembly errors. Detection of the assembly errors leads to an improved assembly (Pletcher et al., 2001; Rat Genome Sequencing Project Consortium, 2004). To avoid assembly errors from occurring in the first place, comparative information may be employed earlier in the assembly process. Ideally, the comparative information is employed so as to simultaneously improve both the extent and quality of assembly.

Selection of appropriate comparative assembly strategy depends on specific circumstances. For example, if the genome in question has been covered by shotgun reads, comparative information may be employed in order to localize the assembly process. Specifically, some of the reads are first directly anchored onto their orthologous locations in the reference genome. The anchored reads are then used as “baits” to “fish out” some of the remaining nonanchored reads using intraspecies read overlap information, thus increasing the number of localized reads. Finally, the localized reads are locally assembled. Localized assembly process improves the extent and quality of assembly at low read coverage, particularly in the presence of repetitive elements.

Human genomic sequence has been used as a reference for the initial assembly of genomes of the mouse and dog (Abbott, 2000; Kirkness et al., 2003). A published NIH report anticipates that primate genome sequencing projects will be greatly aided by the availability of the finished human genome (NIH-NCRR, 2001). A comprehensive pipeline for comparative assembly of bacterial strains has been recently developed (Pop, 2004 #740).

6. Comparative assembly of transcribed sequences

Expressed sequence tags (ESTs) obtained from sequencing cDNA clones can be assembled into transcript sequences using an assembled genomic sequence as a reference. In addition to their inherent biological significance, assembled transcript sequences have utility for the design of gene expression probes. Specifically, human EST fragments can be grouped using human genomic sequence and then locally assembled for the purpose of designing oligonucleotide probes for the analysis of gene expression.

7. Significant trends in sequencing technology

In contrast to the current applications of comparative sequence assembly, which are driven by the availability of assembled genomes, early applications were driven by necessity: short sequence fragments detected by hybridization technology could not be assembled independently and comparative assembly provided the solution (Milosavljevic, 1995; Milosavljevic, 1999). With the advent of DNA chip technologies, comparative sequencing using hybridization probes is now expanding to the genome scale (Frazer et al., 2001; Frazer et al., 2003).

New sequencing technologies have the potential to increase the throughput and decrease the cost of sequencing by orders of magnitude. These trends may lead to the expansion of the applications of sequencing in the same way that the increased performance and decreased cost of microprocessors expanded the applications of computing. One of the obstacles for wide adoption of the new technologies, such as pyrosequencing (Ronaghi, 2001), single molecule array-based sequencing (see Article 7, Single molecule array-based sequencing, Volume 3), massively parallel genome sequencing, and Real-time DNA Sequencing (see Article 8, Real-time DNA sequencing, Volume 3), is higher sequencing error rate and significantly shorter read length compared to the standard Sanger method. Such sequence fragments tend to be too short for independent assembly, particularly in view of the repeat-rich nature of mammalian genomes (see Article 2, Algorithmic challenges in mammalian whole-genome assembly, Volume 7) but they are long enough for the sort of anchored comparative mapping and assembly described above.

8. Summary

Comparative information has the potential to both decrease the cost and accelerate mapping and sequencing projects by reducing experimental effort. Various comparative mapping and sequencing methods have already been put in practice. Future demand for comparative mapping and assembly is likely to be driven by two trends: the increasing availability of reference genomes as potential templates for assembly and advances in sequencing technology. If current trends continue, comparative assembly of individual human genomes and even diagnostic sequencing of individual tumor samples will become a routine practice.

A number of challenges will have to be overcome along the way to efficient gigabase-scale sequence assembly. One is the development of computational means for fast and accurate sequence anchoring based on all-against-all comparisons of exponentially increasing sequence databases. Another significant challenge is the development of comparative mapping and assembly methods and their systematic characterization and validation.