Microbial sequence assembly (Bioinformatics)

The first microbial genome projects began in the 1990s and focused either on important model systems (e.g., Escherichia coli, Saccharomyces cerevisiae) or on important pathogens (e.g., Mycobacterium tuberculosis). The prevailing view for these early microbial projects was that assembling the complete genome sequence must be piecemeal from large insert clones such as cosmids that were ordered into a tiling path along the chromosome (see Article 8, Genome maps and their use in sequence assembly, Volume 7). Each large insert clone would be sequenced by shotgun sequencing and its sequence assembled on the basis of sequence overlaps between the individual random shotgun reads. The assembled large insert clones would then be stitched together to form the complete genome.

The alternative assembly method to this clone-by-clone (CBC) approach was whole-genome shotgun (WGS) assembly, where the entire genome would be fragmented and cloned and the ends of each clone would be sequenced to produce a large amount of random sequence reads. These would then be assembled en masse to produce the final assembly. However, in the early years of microbial genome projects, the CBC approach was broadly adopted out of a concern that repeated sequences in the genome would pose an insurmountable problem for the WGS approach. The repeated sequences could not be unambiguously placed on the basis of sequence overlaps, and would therefore create gaps in the genome assembly.


The concern for the repeated sequence problem was dramatically shattered with the assembly of the Haemophilus influenzae genome using the WGS method at The Institute for Genomic Research in 1995 (Fleischmann et al., 1995). This was the first large-scale (for microbes) WGS project to be completed and set the stage for virtually every subsequent microbial genome project. The software developed for this project, the TIGR assembler (Pop and Kosack, 2004), had many ideas that are used in latter day genome assembly programs, even though the assembly software has grown in sophistication and now handles genomes over 1000 times the size of a bacterium.

The general approach used in most WGS genome projects nowadays is to first create a series of shotgun genomic libraries representing clones with different insert sizes (usually small, 3kb, medium, 10 kb, and large, 40 kb). Both ends of each clone are sequenced until the total number of bases produced is about eight times the number of bases in the genome. The reads are then assembled on the basis of sequence overlaps between reads to produce contigs: blocks of contiguous sequence representing a path of overlapping reads. Contigs end, however, resulting in a gap. The gaps may be due to regions that are missing from the clone libraries, or regions containing repeated sequences that cannot unambiguously be placed in the genome and so are skipped by the assembly software.

The contigs that result from assembling the WGS reads are ordered and oriented with respect to each other using read pair (sometimes called mate pair) information. Each clone in the shotgun libraries was sequenced from both ends, resulting in a pair of reads separated by a distance in the genome equal to the insert size of the clone. This information is used to link contigs together when the reads in a pair fall in different contigs. Contigs that are linked by read pairs are often called scaffolds. Thus, most assembly algorithms deal with the automated construction of contigs and scaffolds.

At this stage of the assembly, there are two kinds of gaps remaining to be filled: gaps between contigs within a scaffold (captured gaps since clones exist that span them) and gaps between scaffolds (uncaptured gaps). In the former case, the polymerase chain reaction is used to generate a product spanning the gap, since primers can be designed and properly oriented on the basis of the sequence of the contigs flanking the scaffold. The latter case poses the greatest challenge since one does not know which contigs flank uncaptured gaps. The principal solution to sequencing these regions is primer walking, where a primer is made from the end of a contig adjacent to a gap and used to sequence from the genome into the gap region. This new sequence either connects the contig to scaffold or, if not, it is used as the target for the design of a new primer and another walk is made. The process is repeated until the gap is spanned.

The gap-filling steps just described tend to be more labor-intensive and costly than the WGS sequencing and automated assembly stages. In fact, the major advantage of the WGS approach is that there is no need for up-front efforts to map large insert clones or develop a tiling path. As a result, a WGS genome project may stop after automated assembly and before the gap-filling stage. The product of such a project is called a “draft” genome sequence. In the contigs, the sequence is of generally high quality and sufficient for gene predictions or design of microarrays, which are often the main deliverables from a genome sequence. The gaps that remain may principally be due to repeated sequences that are often not critical in understanding the phenotype of an organism. Draft sequencing is particularly popular for genomes where gap filling is difficult due to the large size of a genome or other properties. However, a “finished” genome sequence, one with gaps filled and other defects removed, is still superior for detailed analysis of genes and regulatory sequences, where single nucleotide differences can be significant.

Next post:

Previous post: