Algorithmic challenges in mammalian whole-genome assembly (Bioinformatics)

1. Introduction

The basic methodology used today for sequencing a large genome is double-barreled shotgun sequencing. Shotgun sequencing, introduced by Sanger and colleagues early on (Sanger et al., 1977), involves obtaining a redundant representation of a genomic segment with sequenced reads and then assembling the reads into contigs on the basis of sequence overlap. Double-barreled shotgun sequencing enhances this method by introducing the notion of paired-end reads that are obtained by sequencing both ends of a fragment of known size, and thus provides useful long-range linking information (Edwards et al., 1990).

It is relatively straightforward to cover an arbitrarily long genome with sequenced reads, so that few gaps remain (e.g., by the Lander-Waterman model (1988), a 10fold redundancy coverage of the genome with 500-long reads yields less than 1 gap in every 106 bp). Sequencing a genome is simple then, if defined as obtaining a representation of the majority of nucleotides in a genome. What makes sequencing challenging is the subsequent computational problem of correctly assembling the reads into the original sequence.

2. Repeats, sequencing errors, and their relationship

Assembly would be easy were it not for repeats. If genomes had the properties of random strings on four letters {A, C, G, T}, then it would be unlikely for two reads to share a significant overlap unless they represented overlapping segments in the genome – a true overlap. Unfortunately, nearly half a mammalian genome’s length consists of repeats, and true overlaps can be confused with repeat-induced overlaps. The main challenge of assembly is to distinguish between true overlaps, where reads should be merged, and repeat-induced overlaps, where merging should be (usually) avoided to prevent misassemblies – assembled sequence that does not occur in the genome. Repeats come in many varieties. The number of copies ranges from 2, such as in gene duplications, to about 106 in the case of ALU repeats in the human genome. Sequence similarity between copies can be almost 100%, such as in recent duplications, or very low, such as in ancient transposon lineages that have become inactive. Our ability to distinguish between true and repeat-induced overlaps is intimately tied to the sequencing error rate. In essence, if more than a proportion r of sequencing errors is unlikely, then any overlap with more than 2r mismatches and gaps is likely to be repeat-induced. Therefore, increasing read accuracy reduces the repeat content of a genome for the purposes of assembly.

3. Two main ways to deal with repeats: clustering the reads and pairing the reads

There are broadly two methods for assembling long genomic regions in the presence of repeats. The first method is clone-based sequencing, whereby the reads are being clustered by performing shotgun sequencing on medium-sized segments such as Bacterial Artificial Chromosome (BAC) clones of length ~200kb. With this method, the effective repeat content is drastically cut because most segments that are repeated across the genome are unique within such short regions. Moreover, the number of reads to be assembled is reduced by a factor of 104, and therefore the computational assembly problem becomes much simpler. The second method is double-barreled whole-genome shotgun (WGS) sequencing, whereby paired reads are obtained from both ends of inserts of various sizes (Edwards et al., 1990). Paired reads can resolve repeats by jumping across them: when taken from plasmid inserts longer than 5000 bp, they can resolve LINE repeats of that length; paired reads from longer inserts such as cosmids and BACs can resolve duplications, help build longer scaffolds, and verify the large-scale accuracy of an assembly.

4. Clone-based sequencing

Clone-based sequencing was used to sequence several genomes including those of the yeast Saccharomyces cerevisiae (Oliver et al., 1992; Dujon, 1996), Caenorhab-ditis elegans (The C. elegans Sequencing Consortium, 1998), Arabidopsis thaliana (The Arabidopsis Genome Initiative 2000), and humans (International Human Genome Sequencing Consortium, 2001). Most applications of clone-based sequencing were performed under the “map first, sequence second”, or physical mapping approach. Physical mapping involves constructing a complete physical map of a large set of clones covering the genome with redundancy, and then selecting a minimal tiling subset of those clones to be fully sequenced. After sequencing and assembly, errors in the original map, gaps, and misassemblies can be spotted, and if necessary, corrected with additional targeted sequencing. This approach was successfully used to generate physical maps for sequencing the 5. cerevisiae and C. elegans genomes (Coulson et al., 1986; Olson et al., 1986). The human genome was also sequenced by the public consortium using this approach, although the first draft large-scale assembly was still a challenging computational task: ordering and orienting the assembled contigs into a global genome assembly required integration of various sources of information such as sequence overlap, mRNA, and EST fragments, and end-sequencing of BACs. All these pieces of evidence were put together by a specialized system called GigAssembler (Kent and Haussler, 2001). Physical mapping is by no means the only possible way to perform clone-based sequencing. Other methods are possible but less explored, such as the walking approach (Venter et al., 1996; Batzoglou et al., 1999; Roach et al., 1999; Roach et al., 2000), or light shotgun sequencing of completely unmapped random clones, whereby a collection of clones covering the genome with redundancy C1 are sequenced to redundancy C2 for a total sequencing depth C1C2. As clustering directly addresses the repeat challenge by effectively eliminating most repeats, assembly under any of the above schemes is, in principle, easier than under double-barreled sequencing.

5. Whole-genome shotgun sequencing

Shotgun sequencing was initially applied by Sanger to the genome of bacteriophage k (Sanger et al., 1980, 1982). Over the years, shotgun sequencing has been applied to genomic segments of increasing length. In the 1980s, it was applied to plasmids, chloroplasts (Ohyama et al., 1986), viruses (Goebel et al., 1990), and cosmid clones of length roughly 40 kb, which were considered to be the limit of this approach. In the 1990s, mitochondria (Oda et al., 1992), BACs of length 200 kb carrying genomic DNA from human or other organisms, as well as the entire 1800 kb genome of the bacterium H. influenzae (Fleischmann et al., 1995) were sequenced. Compared to clone-based shotgun sequencing, WGS poses a much harder computational challenge, especially in large, repeat-rich genomes. When a WGS strategy was proposed for sequencing a mammalian genome such as human (Weber and Myers, 1997), scientists debated whether such as strategy would succeed (Green, 1997). Today, we know that this is possible, as demonstrated by WGS assemblies of several complex genomes such as Drosophila melanogaster (Myers et al., 2000), human (Venter et al., 2001), and mouse (Mouse Genome Sequencing Consortium, 2002; Mural et al., 2002). These successes have brought computer science to the forefront of genomics: WGS assembly became possible through advances in algorithms and software systems rather than in either computer processor speed or sequencing accuracy.

6. Overview of WGS assembly

The prevailing methodology for WGS assembly is the three-stage overlap-layout-consensus strategy (Peltola et al., 1984; Kececioglu, 1991; Huang, 1992; Kececioglu and Myers, 1995). During the overlap phase, all read-to-read sequence overlaps are detected. During layout, subsets of the overlaps are selected in order to merge reads into longer scaffolds of ordered and oriented reads. Those scaffolds are converted into sequence through a multiple alignment during the consensus phase. One of the most successful systems employing this three-phase strategy is PHRAP (Green, 1994), which has been the standard for BAC-size assembly, but is unable to handle whole mammalian-size genomes. Current WGS assemblers include the Celera assembler (Myers et al., 2000), Euler (Pevzner et al., 2001), Arachne (Batzoglou et al., 2002; Jaffe et al., 2003), Phusion (Mullikin et al., 2003), Atlas (Havlak et al., 2004), and others. Most of these systems, at least implicitly, use the overlap-layout-consensus strategy.

In the following paragraphs, we will briefly describe each of the assembly phases, focusing on the challenges they pose and on the algorithms that current WGS assemblers employ to address these challenges. We will be using the following terminology: a contig is a multiple alignment of reads, which can be converted into contiguous genomic sequence during the consensus phase. A supercontig is a structure of ordered and oriented contigs, usually derived by linking contigs according to paired reads. Some papers refer to ultracontigs as higher structures of supercontigs that are derived by various sources of linking information, such as paired reads from large inserts, mapping of BAC clones, or even cross-species comparison. The general term for assembled structures is the scaffold, a term usually interchangeable with supercontig. As the layout phase is the most complex, a useful conceptual (and implementation) division of this phase is in two subphases: (1) contig layout, during which most assembly systems strive to identify repeat boundaries and connect the reads into contigs that are as long as possible without crossing those boundaries, as that would induce misassemblies and (2) supercontig assembly, perhaps the hardest phase of all, where contigs are chained into longer structures by use of paired reads crossing contig boundaries, and subsequently, gaps are filled and misassemblies are identified and remedied.

7. The overlap phase

In the overlap phase, read-to-read sequence overlaps are detected. Note that since read orientation is not known, all overlaps between reads in forward or reverse complement direction are detected. The nature of the data requires several preprocessing tasks to take place prior to overlap detection, such as trimming of reads to remove bases with quality scores that are too low and filtering of contamination from sequencing/cloning vector or mitochondrial sequences. Next, in the case of large-scale WGS, a significant algorithmic challenge is to find the overlaps efficiently. The strategies that have been employed rely on detecting exact word matches between the reads. Celera, Phusion, and Arachne, all perform a global step of indexing all exact fixed-length word occurrences in the reads and then aligning all pairs of reads that share a word. Prior to alignment, filtering of very frequent words is necessary to avoid a huge number of pairwise read matches. Phusion and Arachne mask all highly repetitive k-mers, while Celera masks repeats, a step that is inherently more conservative and will discard additional spurious matches, but is also more complicated. Alignment is done with fast procedures that rely on the fact that only almost exact sequence identity between two reads can induce an overlap. Phusion and Arachne’s original indexing strategies were more time-efficient, but required more memory, than Celera’s strategy. Arachne performs this step in several passes, trading off speed for memory.

Upon detecting potential read overlaps, most assemblers filter them so as to remove the ones that are very likely to be repeat-induced. Arachne and Euler in particular, use a sequencing error-correction strategy that greatly simplifies subsequent assembly. In Arachne, the main idea is that a sequencing error reveals itself as a singleton base aligned to several other reads that all agree to a different base – in contrast to a repeat polymorphism, which should be shared by several reads. Euler’s error correction is based on finding and modifying low-frequency words in the reads, and is generally more aggressive, resulting in a drastic reduction in the number of errors by as much as a factor of 50. After error correction, Arachne aggressively discards read-to-read overlaps that disagree in high-quality bases, because such overlaps are overwhelmingly likely to be repeat-induced. Most assemblers at this stage also discard likely chimeric reads, that is, reads that are formed by two distant segments of the genome being glued together. The signature of such reads is that they overlap other reads to the left and right, but no overlap spans the chimeric joint.

8. Contig layout

The prevailing way of representing all read overlaps is the overlap graph (Myers, 1995), where reads are nodes, and overlaps are labeled edges that contain information on the location and (optionally) quality of overlap. Only full overlaps between reads x and y are stored in the graph: partial overlaps that end somewhere within the two sequences are clearly repeat-induced and not stored in the graph. Repeat boundaries can be detected in an overlap graph by finding a read x that extends to two reads y and z to the left (or right), where y and z do overlap one another. The Celera assembler and Arachne use this overlap graph to merge reads into contigs up to the boundaries of repeats, thus joining all local interval subgraphs into contigs. To assemble contigs, Arachne employs a further heuristic: paired pairs of reads, defined as two plasmids of similar insert size with sequence overlaps occurring at both ends, are identified and merged, except under a heuristic condition that detects and avoids the instance of long repeats. Paired pairs help to cross high-fidelity repeats of length slightly longer than a read length. Phusion instead clusters the reads and then passes the clusters to local assembly calls to PHRAP. One advantage of this approach is the use of PHRAP for low-level assembly, because PHRAP has been highly tuned to work well with real data. Most assemblers use several additional heuristics during contig layout, in order to extend contigs as far as possible. For example, sequencing errors near the end of one read are identified as such by noticing that the read does not overlap any other read on that side. Contig breaks due to such sequencing errors are then crossed. Also, contig breaks due to repeats that are shorter than one read length are resolved by finding reads that span the different repeat copies. Precisely, a repeat can be resolved if each copy, except for at most one, is spanned by a read.

The resulting contigs constructed by Celera and Arachne are either unique, representing sequence that occurs once in the genome, or repetitive, representing a high-fidelity repeat sequence whose reads were merged together. In Arachne, the repeat has to exhibit nearly 100% fidelity. Contigs are classified as unique or repetitive by measuring the density of reads within a contig: repetitive contigs are likely to be twice or more times as dense as unique contigs (Myers et al., 2000).

9. Supercontig layout

Once contigs are constructed and classified as unique or repetitive, the Celera and Arachne assemblers link them into larger structures, called scaffolds or supercon-tigs. One read-pair link can be an error, but two or more mutually confirming links that connect a pair of contigs are very unlikely by chance, and – in general -can be safely used to join the contigs into supercontigs. Most assemblers first join the unique contigs while excluding likely repeat ones to produce supercontigs that represent correctly ordered and oriented unique sequence with gaps. Then, gaps are filled by recruiting the repeat contigs. This last stage of repeat resolution is rather complicated, and involves a combination of several heuristic approaches. The Celera assembler fills in repeats in a series of increasingly aggressive steps by recruiting first repetitive contigs that contain two or more read-pair links to the flanking unique contigs, then those that contain one link and some sequence overlap, and finally the rest conditionally on some heuristics. Arachne attempts to walk through the gap between two unique contigs by a shortest-path search that requires anchors spaced at almost every 2 kb, where an anchor is a repetitive contig with read-pair links to the flanking unique contigs. This heuristic, in practice, results in a reasonably accurate repeat reconstruction, except in the case of high-fidelity tandem repeats, which are often collapsed and show up as deletions with respect to the true sequence.

10. Derivation of consensus sequence

Once a layout has been obtained, deriving consensus sequence is a relatively straightforward task. Starting from the leftmost read of each contig, consensus sequence is derived from the multiple alignment of reads. Conservative quality scores are given to each consensus base according to the quality and agreement of bases in the multiple alignment column. The Celera assembler will additionally report positions that appear polymorphic and give an estimate of the likelihood that the polymorphism is real versus repeat-induced.

11. Discussion on existing WGS assemblers

An elegant formulation of sequence reconstruction as an Euiler Path Problem on a de Bruijn graph (Pevzner, 1989), originally proposed in the context of sequencing by hybridization, has led to the development of the Euler assembler (Pevzner et al., 2001). According to that approach, a graph is constructed whose nodes are the k-long words in reads and whose edges connect pairs of k-long words that share a (k -1)-long overlap. Read overlaps are thus implicitly found in an efficient manner similar to Arachne’s and Phusion’s approach. Error correction is done directly on the k -long words by modifying words that are too infrequent. Assembly then reduces to finding the correct Eulerian path – there are usually a large number of Eulerian Paths. An Eulerian Path is a path that uses every edge exactly once. To achieve that, a series of graph simplifications are taken on the basis of the reads and the read-pair links. This approach has the advantage of clean formulation and demonstrably strong performance in resolving repeats – or identifying all assembly ambiguities due to repeats. One challenge is to develop an implementation that is efficient enough for whole mammalian genome assembly. The de Bruijn graph provides the same branching structure (repeat representation) as Myers’s original overlap graph but at k-mer resolution rather than fragment resolution and provides more precise repeat boundaries at the cost of extra memory usage.

Several assemblers exist for large-scale application. The Celera assembler, Arachne, and Jazz effectively use the overlap-layout-consensus paradigm combined with read-pair information. Phusion follows a different approach, motivated by the fact that PHRAP is a well-tried and effective approach for small-scale assembly: Phusion first clusters the reads efficiently and then sends several local assembly problems to PHRAP. The assemblies are then merged and corrected according to read-pair information. Arachne and Phusion were both used for assembling the Mouse genome, with comparable performance (Jaffe et al., 2003; Mullikin et al., 2003). Atlas is a suite of programs for assembling a genome with an approach that combines clone-based and WGS data, and was used to assemble the rat genome (Rat Genome Sequencing Project Consortium, 2004).

12. Sequencing and assembly in the future

A few years ago, WGS assembly was considered impossible for large and repeat-rich genomes. Today, it is routinely applied, thanks to the development of computational methods led by the Celera assembly team and by public efforts. The success of WGS assembly contributed to the interest in bioinfor-matics methods within the biology community. Although the assembly research pertinent to current sequencing technology seems to be reaching diminishing returns, additional open problems are coming from the development of new and revolutionary sequencing technologies that promise to deliver the “$1000 genome” within a decade (http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-04-003.html). Computationally, such technologies can be largely divided in two categories: (1) those that produce short, accurate reads cheaply, such as Pyrose-quencing (Ronaghi et al., 1998; Ronaghi, 2001) or other technologies (Mitra and Church, 1999; Mitra et al., 2003; Braslavsky et al., 2003) and (2) those that produce very long reads, but potentially with very high error rates (see Pennisi, 2002). Short reads should be harder to assemble, even if the mate-pair information is available. The reason is, again, effective repeat content: fewer exact repeats will be completely spanned by a read length. However, if reads are clustered without sacrificing the method’s speed and low cost (e.g., by following a random clone-based sequencing approach that does not involve physical mapping), assembly should be possible. Long reads will present interesting alignment and large-scale indexing problems. In general, once the technological hurdles are overcome and a concrete computational problem is formulated, we can be confident that computational biologists will solve it and help ultimately make the $1000 genome a reality.