Genome sequencing of microbial species (Genomics)

1. Whole-genome shotgun cloning: a revolution in the microbial field

Microbes account for most of life on earth and are critical to its ecological balance. However, researchers have only scratched the surface of the tremendous biodiversity that these organisms display. Less than 1% has been cultured – only a minute proportion of the microbial diversity present in the environment. If this diversity is an indicator of the physiological, metabolic, and adaptation ability of the uncultured microorganisms, one can barely start to imagine the enormous diversity that can be discovered among the microbes on earth. No other field of research has embraced and applied genomic technology more than the field of microbiology, and genomic science has provided information that cannot be obtained by any other means. Microbial genomics has a broad range of applications, from understanding basic biological processes, host-pathogen interactions, and protein-protein interactions to discovering DNA variations that can be used in genotyping or forensic analyses. In addition, genomic data is being applied to unravel gene expression patterns through the development and analysis of DNA microarray data.

In 1995, The Institute for Genomic Research led by J. Craig Venter sparked a revolution in genomics by using whole-genome shotgun sequencing (Fleischmann et al., 1995) (Figure 1) to determine the first complete genome sequence of a free-living organism, the bacterium Haemophilus influenzae. Since that first report, more than 220 microbial genomes have been sequenced and at least another 650 are in progress (February 2005; http://www.genomesonline.org/) (Figure 2). This global effort has focused primarily on pathogens, which to date account for the majority of all genome projects (Figure 2), and has generated a large amount of raw material for in silico analysis. Additionally, in recent years, multiple in recent years, multiple strains of the same species, or multiple species of the same genus, have been the targets of sequencing projects, opening the possibility of comparing closely related genomes (see Article 61, Comparative genomics of the e-proteobacterial human pathogens Helicobacter pylori and Campylobacter jejuni, Volume 4). This will improve our understanding of microbial biology, pathogenicity, and evolution. However, the major challenge in the postgenomic era is to fully exploit and decipher this new accumulating wealth of information.


The steps involved in the whole-genome shotgun sequencing procedure, (a) Library construction. Total genomic DNA is extracted and mechanically sheared to smaller fragments. Each fragment is ligated into a cloning vector, (b) Random sequencing. About 6000 random clones per megabasepair are sequenced from both end of the insert to achieve 8X coverage, (c) The small sequences (~ 800 bp) are assembled into larger contigs using computational algorithms, such as the Celera Assembler, (d) The contigs are linked to each other during the closure phase, where the sequence is also manually edited, (e) Annotation. Using programs such as Glimmer, open reading frames (ORFs) are marked. The predicted protein sequences from these putative open reading frames are searched against nonredundant protein databases, (f) A Complete genome is obtained after manual curation of the annotation

Figure 1 The steps involved in the whole-genome shotgun sequencing procedure, (a) Library construction. Total genomic DNA is extracted and mechanically sheared to smaller fragments. Each fragment is ligated into a cloning vector, (b) Random sequencing. About 6000 random clones per megabasepair are sequenced from both end of the insert to achieve 8X coverage, (c) The small sequences (~ 800 bp) are assembled into larger contigs using computational algorithms, such as the Celera Assembler, (d) The contigs are linked to each other during the closure phase, where the sequence is also manually edited, (e) Annotation. Using programs such as Glimmer, open reading frames (ORFs) are marked. The predicted protein sequences from these putative open reading frames are searched against nonredundant protein databases, (f) A Complete genome is obtained after manual curation of the annotation

Completed genome sequencing project timeline

Figure 2 Completed genome sequencing project timeline

The whole-genome shotgun sequencing strategy does not require an initial mapping step to create a set of overlapping clones, and instead relies on computational methods (TIGR Assembler (Sutton et al., 1995), the Celera Assembler (Myers et al., 2000), and Phrap (http://www.phrap.org)) to correctly assemble tens of thousands of random DNA sequences 300-900-bp long. In some cases, the algorithms underlying the assembly software have also been shown to be powerful enough to successfully assemble larger eukaryotic genomes including the human genome (Venter et al., 2001; see also Article 25, Genome assembly, Volume 3). Given the current state of sequencing technologies, whole-genome shotgun sequencing remains the industry standard.

2. Genome annotation

The first step in the analysis of a completed and fully assembled genome is to determine the precise location and assign a putative function to all the protein coding regions, through a process known as annotation (see Article 29, In silico approaches to functional analysis of proteins, Volume 7). A wide variety of bioinformatics methods that have been developed to analyze sequence data have made annotation an increasingly sophisticated process. Computational gene finders (see Article 13, Prokaryotic gene identification in silico, Volume 7) using Interpolated Markov modeling algorithms, such as Glimmer (Delcher et al., 1999), are routinely capable of finding more than 99% of all genes in a microbial genome. The predicted protein sequences from these putative open reading frames (ORFs) are searched against nonredundant protein databases and well-curated protein families, such as the PFAM (Bateman et al., 2002) and TIGRFAM (Haft et al., 2003) collections, that have been created using hidden Markov models (HMMs). HMMs are powerful statistical representations of groups of proteins that share sequence, and consequently, functional similarity. HMMs can represent very specific enzymatic functions or a superfamily of related functions. The use of HMMs has helped refine the annotation process. In addition, searches for PROSITE motifs (Sigrist et al., 2002), lipoproteins, signal peptides, and membrane-spanning regions are performed. On the basis of the evidence gathered, a two-stage annotation protocol is carried out whereby an initial automated annotation is followed by manual curation of each gene assignment by an expert biologist to ensure accuracy and consistency of the putative function of each predicted coding region. Proteins whose specific function cannot be confidently determined are designated “putative” or given a less specific family name. Proteins without any significant matches in any of the searches performed are annotated as hypothetical proteins. Consistent description and annotation of genes in different databases is critical to facilitate uniform queries across independent databases. This problem is being addressed by the development of controlled vocabularies (ontologies), such as the Gene Ontology (GO) project (The Gene Ontology Consortium, 2004; see also Article 82, The Gene Ontology project, Volume 8), where gene products are described in terms of their associated biological process, cellular components, and molecular functions in a species-independent manner.

3. What have we learned so far?

High-throughput genome sequencing technologies have only been around for less than 10 years, but the impact of these technologies has been profound. Genome sequence data have been obtained from representative species of all three domains of life (Figure 2); however, because of their relatively small size, bacterial and archaeal genomes have dominated the field (Figure 2). Taken together, comparative genome analysis has revealed interesting patterns pertaining to microbial species; for example, gene density in microbes is very consistent with about one gene per kilobase of DNA. Although we are able to identify microbial genes with a high degree of success, we cannot assign a function to about a quarter of all the ORFs in each species sequenced so far. This observation demonstrates how little is known about the biology and biochemistry of microbial species, and supports the idea of an incredible microbial diversity. These sets of genes that encode hypothetical proteins represent exciting opportunities for the research community and are not only potential sources of biological resources to be explored for future use, but also clearly indicate the need for further extensive genetic, enzymatic, and physiological analyses, before genomic data can be fully exploited.

Analysis of more than 150 microbial genome sequences has revealed an unexpected diversity and variability in genome size and structure, even in species previously thought to be identical. Many microbes possess diverse chromosome architectures that are quite different from the classical single circular chromosome. For example, the genome sequence of the human pathogen, Vibrio cholerae, unexpectedly revealed the presence of two circular chromosomes (Heidelberg et al., 2000), whereas the genome of Borrelia burgdorferi (Casjens etal., 2000; Fraser etal., 1997), the causative agent of Lyme disease, contained a relatively small (910 kb) linear chromosome and an unprecedented number of 21 linear and circular plas-mids. On the other hand, the Streptomyces coelicolor linear chromosome is more than 9-Mb long (Bentley et al., 2002). In addition to differences in genome structure, microbial genomes vary largely in their GC content ranging from 24% to more than 70%. The effect of this disparity in GC content is reflected in the wide range of codon usage and the amino acid composition of proteins among various species.

As noted earlier, the study of bacterial pathogens has dominated and influenced the microbial genomic arena. This has resulted from the potential for developing a better understanding of virulence as well as identifying putative targets for vaccine and antimicrobial drugs. Access to the genomes of a variety of pathogens has allowed scientists to broaden their knowledge of pathogenicity through comparative genome analysis.

Organisms that belong to the same genus can differ in gene content by as much as 25% as it was found when the genome of Escherichia coli K-12 was compared to E. coli 0157:H7 (Hayashi etal., 2001; see also Article 51, Genomics of enterobacteriaceae, Volume 4). Insertion and deletion events appear to have played a major role and account for most of the differences observed. Pathogenicity islands, which are large blocks of self-mobile DNA that carry genes enabling an organism to act as a pathogen, have the ability to transfer from one organism and integrate into a new host. Other pathogens show little variation in chromosomal gene content, as demonstrated by the comparison of the genomes of two isolates of Yersinia pestis (Deng et al., 2002; Parkhill et al., 2001), the etiologic agent of plague (see Article 58, Yersinia, Volume 4). Remarkable differences in the chromosome structures, dominated by genome rearrangements, accounted for most of the variation observed between these two closely related strains. The differences appear to result from multiple inversions of genome segments at insertion sequences. Y. pestis sp. carry most of their virulence determinant on plasmids, which are absent in its ancestor, Yersinia pseudotuberculosis. A remarkable number of pseudogenes (degenerated and inactive genes) have been found on the genomes of Y. pestis, an indication of a recent and still evolving genome.

Often, differences between a pathogen and a nonpathogen cannot be explained solely by looking at gene presence or absence, but by subtle single nucleotide changes. These changes can have disproportionately large consequences. Important virulence genes have been shown to be completely inactivated by such changes. Virulence or survival can also be modulated by hypervariable short homopolymeric sequences, which vary in size during replication, and can result in frameshifts and inactivation or activation of important virulence genes, as seen in the human pathogens Helicobacter pylori and Campylobacter jejuni (Parkhill et al., 2000; see also Article 61, Comparative genomics of the e-proteobacterial human pathogens Helicobacter pylori and Campylobacter jejuni, Volume 4).

Genomic information can also be used to design novel vaccines and drugs. In a pioneering study, Pizza et al. (2000) have exploited the genome sequence of Neisseria meningitidis to identify two highly conserved vaccine candidates within a set of cell-surface expressed or secreted proteins. There is no doubt that genomics has contributed enormously to a better understanding of bacterial pathogenicity, however, one genome is not enough. There is much that is still unknown and comparative genomics of close relatives of both pathogens and nonpathogens will be critical to unravel the secrets of microbial pathogenicity and continue the search for better and innovative vaccines or drugs.

The initial focus on pathogenic microbial species has shifted to include non-pathogenic environmental microbes. Understanding and accessing the tremendous microbial biochemical diversity that exists in the environment could have an important impact on industrial processes and help in resolving environmental issues, such as the bioremediation of human pollution.

Many archaea are considered extremophiles, as they often thrive under “extreme” conditions, such as high or low temperatures, high pressures or high salt concentrations among others. The novel enzymes encoded in these genomes (Figure 2) offer clear potential for biotechnological applications. In addition, genome analysis of the hyperthermophilic bacteria, Thermotoga maritima (Nelson et al., 1999) revealed that 20-25% of the genes in this species were more similar to genes from archaea than from bacteria, leading to a renewed interest in the process of lateral gene transfer and the role that it plays in microbial evolution and diversity (see Article 66, Methods for detecting horizontal transfer of genes, Volume 4).

Among the bacteria, the genome sequence of Deinococcus radiodurans (White et al., 1999), the most radiation-resistant organism on earth, and Geobacter sul-furreducens (Methe et al., 2003), which can clean up uranium and organic waste contamination, will allow scientists to develop and optimize practical applications, such as the bioremediation of radioactive metals and harvesting electricity from waste organic matter. The genome of Photorhabdus luminescens, an insect pathogen living in symbiosis with a nematode has been fully sequenced (Duchaud et al., 2003). The analysis uncovered a variety of genes coding for ento-mopathogenic toxins, potentially useful in the fight against insect pests. Moreover, P. luminescens carries a large number of genes coding for the biosynthesis of antibiotics and fungicides, which could have potential applications for the treatment of infectious diseases. The genomes of Streptomyces coelicolor (Bentley et al., 2002) and Streptomyces avermitilis (Ikeda et al., 2003; Omura et al., 2001), both known to produce a wide variety of natural products, will assist in genome engineering to make novel and more efficient antimicrobial agents.

Researchers have only scratched the surface of microbial biodiversity. In order to harvest this enormous potential, genome shotgun sequencing is being applied to the environment. In a landmark study, the microbial populations from water samples collected in the Sargasso Sea were sequenced (Venter et al., 2004). An estimated 1.2 million new genes have been identified from at least 1800 genomic species. Similar techniques were applied to a community of microbes from a biofilm growing at pH 0.83 on the surface of acid mine drainage (Tyson et al., 2004). In this study, the low diversity genomic community was entirely reconstructed -the subsequent examination of the metabolic capabilities of this community gave valuable information on how each organism participates to the ecology of the biofilm. These types of microbial studies will help us define the entire repertoire of organisms in specialized niches and ultimately the mechanisms by which they interact in the biosphere.

With the technical advances of genome sequencing and analysis, genomics has also found an application in the field of microbial forensics. After the bioterror events of October 2001, where letters containing spores of Bacillus anthracis, the causative agent of anthrax, were sent through the mail, the genome of the B. anthracis isolate responsible for the death of a Florida man was rapidly sequenced and single nucleotide polymorphisms were found that could help identifying the origin of the samples used in this attack (Read et al., 2002).

4. Conclusions

Scientists in a number of different fields have employed the tools of genomics – no field has embraced and applied these technologies as quickly and effectively as the field of microbiology. Genomics will continue to improve the quality of human life well into the future as scientists continue to unravel the enormous amount of data that is being accumulated. More genome sequences are needed, new annotation tools must be developed and applied, and the databases that archive genomic data must be improved for better cross communication and up-to-date data. There is no question that genome-sequencing technologies are rapidly improving and that the data are going to accumulate at a faster pace in a future. The genomics community needs to be prepared to analyze and make use of this forthcoming deluge of information. However, because genome sequence should not be considered an end-point and is only the first step in understanding biological processes, the microbial scientific community at large needs also to be trained and ready to make better use of this incredible resource.

Next post:

Previous post: