In Depth Tutorials and Information

Errors in sequence assembly and corrections (Bioinformatics)

1. Introduction

The major source of all the challenges is the limitation in sequencing technologies, which today allows us to routinely sequence only about 500 to 800 bases of contiguous DNA sequence. To overcome the limitation of short contiguous sequences, Frederick Sanger devised the shotgun sequencing technique and in 1982 demonstrated its potential by sequencing the genome of the bacteriophage lambda (Sanger etal., 1982). The process of shotgun sequencing begins with physically shearing the original DNA molecules into small pieces as randomly as possible. The fragments are subsequently inserted into cloning vectors and amplified by growing them in Escherichia coli. The ends of the inserts are sequenced and sequenced reads assembled by specialized computer software, shotgun fragment assembly programs (Havlak et al., 2004; Huang et al., 2003; Huson etal., 2001; Jaffe et al., 2003; Kent and Haussler, 2001; Mullikin and Ning, 2003; Pevzner etal., 2001; Pop etal., 2004; Tammi etal., 2003a; www. phrap.org.).

Despite continuous technology improvements during the three decades after Sanger devised the dideoxy DNA sequencing method (Sanger etal., 1977), the average sequence length in routine production has only increased from about 200-300 contiguous bases to 500-800 bases. Although the average length has more than doubled, the increase is still too small to make a significant impact on the efficiency the fragment assembly. However, the throughput has increased enormously thanks to fluorescent detection of the base sequence (Smith etal., 1986) and automatic sequencers.

Although the sequence assembly problem may appear simple, it is known to be NP-complete. In addition, sequence assembly is complicated by a number of factors, but the three most important ones are sequencing or base-calling errors, genomic repeats, and polymorphism (see Article 11, Algorithms for sequence errors, Volume 7 and Article 2, Algorithmic challenges in mammalian whole-genome assembly, Volume 7).

2. Sequencing errors

Sequenced fragments contain base-calling errors, which can be incorrectly determined bases, insertions, and deletions. The number of errors tends to increase toward the ends of the reads. This is unfortunate, since most of the overlaps between reads are determined using the very ends of the sequenced fragments.

3. Repeats

In shotgun sequencing, the original idea of putting the puzzle of sequences together by using sequence similarity often fails in the case of repeats. Genomic sequence contains many kinds of repeats of varying lengths. Repeat copies may be identical or almost identical, only differing by a few bases. They can be dispersed all over the genome, and/or repeated in tandem, and contain any number of copies. Repeated regions are difficult to separate and cause assembly programs to assemble fragments originating from different copies together, resulting in erroneous assemblies. The combination of sequencing errors and repeated sequences pose the greatest challenge in shotgun fragment assembly. It is probably close to impossible to correctly assemble identical repeat copies. However, it may be argued that no information is lost, but identical repeat copies may cause large artificial genomic rearrangements. Repeats that are shorter than the average read length cause fewer problems than longer copies.

4. Polymorphism

In whole-genome shogun sequencing, polymorphism complicates the assembly of nearly identical repeats and in some cases also nonrepetitive regions, but it is not a problem in a “clone-by-clone” approach, since only one variant of each genomic region is sampled (see Article 12, Polymorphism and sequence assembly, Volume 7).

5. Incomplete coverage

The fraction NL represents the amount of oversampling of the genomic sequence G, where N is the number of sampled reads and L is the average length of the reads. This is also called coverage. Some assembly programs use statistical methods, searching the best overlaps and may easily assign false overlaps if no better overlap is present due to lack of coverage. There are genomic regions that are impossible to sequence due to biological reasons. However, the output from an assembly program may consist of several contigs also due to an erroneous assembly, or due to the nature of sampling process itself, all resulting in gaps in coverage. The coverage has a certain probability to be zero, depending on the amount of sampling. When reads are sequenced from random fragments, not all genomics positions are equally sampled. The Lander and Waterman expression (Lander and Waterman, 1988) can be used to calculate the average number of gaps, contigs, and their average length in a shotgun project; if a target sequence G is redundantly sampled, at an average coverage c and assuming that the sheared fragments are uniformly distributed along the genomic sequence, the coverage at a give base b is a Poisson random variable with mean c:

For example, the fraction of G that is covered by at least one fragment is 1 – e-c. The Lander and Waterman expressions can also be used to compute the fraction of reads involved in overlaps, detected by an assembly program, but which ones of the particular gaps are due to an erroneous assembly of reads is of course hard to estimate. However, some methods have used a high coverage as an indicator of many copies of repeats assembled together (Myers et al., 2000).

An additional complicating factor is the unknown orientation of the shotgun fragments. It is not known from which DNA strand each sequenced fragment originates, and this increases the complexity of the assembly task. Hence, a read may be present on one strand or it is the reverse complement sequence on the other strand. In a sequencing project containing one million sequenced reads, a complementary set of reads must be generated resulting in two million reads. The complete set is required for the evaluation of all overlaps. This and possible contamination and unremoved vector sequences may cause severe errors in assemblies.

A number of methods have been invented in order to address these problems. One powerful way is the assignment of error probabilities to base-calls (Ewing and Green, 1998). The knowledge of the quality of the base-calls allows efficient trimming of the reads and statistical evaluation of candidate overlaps (www.phrap.org). A detailed analysis of highly similar repeat copies may be performed (Tammi etal., 2002) as well as error correction of the sequenced reads (Tammi et al., 2003b; Pevzner et al., 2001). Since clone inserts are usually sequenced from both ends and the insert lengths are known, this information can be used to position the read pairs within the assembly (Edwards and Caskey, 1990; Myers etal., 2000). It is likely that the combination of the information given by sequenced pairs and statistical methods for analysis of highly similar repeats yields the most powerful approach in the struggle against misassemblies caused by repeats and sequencing errors. It is, however, clear that current sequence assembly software is only capable of producing a draft sequence, and much labor-intensive manual finishing is required to arrive at a good quality, reliable, finished sequence.

The human genome was sequenced both by a public (Lander et al., 2001) and a private initiative (Venter et al., 2001). The public initiative used the hierarchical or “clone-by-clone” approach, a method that involves extensive mapping. The mapping step was avoided by the whole-genome shotgun (WGS) approach, which was used by Celera in the privately funded effort. A comparison of these approaches by She et al. (2004) showed that the WGS approach runs into problems on the repeated parts of the genome. Apart from the highly repetitive telomere and centromere regions that are not targeted by either initiative, the WGS approach was unable to adequately resolve larger than 15 kb where the difference between copies was 3% or less. About 4% of the sequence was lost or erroneously assembled because of collapsed repeated regions. This leads to significant reduction of the actual genome length and the loss of many biologically important regions including genes. It is likely that a combination of the sequencing strategies is the most advantageous one.

Next post: Genome maps and their use in sequence assembly (Bioinformatics)

Previous post: Statistical signals (Bioinformatics)