Operon finding in bacteria (Bioinformatics)

Bacterial genes are often organized into multigene transcriptional units (TUs), a series of genes that are transcribed together into one messenger RNA (mRNA) molecule. A TU starts with a promoter, which initiates transcription, and ends with a terminator, which terminates transcription (Figure 1a). The expression of the genes in a TU is controlled by one set of regulatory gene(s), which is often located nearby. The term “operon” was originally defined to include both the TU and the associated regulatory genes; however, today, the term is often used to refer only to a set of cotranscribed genes. This is especially common in computational genomics, where it is often necessary to simplify the model of an operon in order to make computational predictions. Indeed, the model shown in Figure 1(a) is simplified in a number of other aspects, too. Additional promoters and terminators may be located between the cotranscribed genes, and genes may be cotranscribed in some conditions and transcribed separately in other conditions. These details usually cannot be captured by operon-prediction methods and, therefore, they are not discussed here. Genes that are transcribed separately from other genes can be considered as one-gene TUs.

Finding TUs computationally is a difficult task; the predictions are much less accurate than the gene predictions themselves. The apparent solution would be to find the stretches of genes located between a promoter and terminator; however, this requires accurate knowledge of promoters and terminators, which are usually at least as difficult to predict as operons. The only exception here is rho-independent terminators, which have a distinctive structure and can, for some species of bacteria, be predicted accurately.


While it is difficult to predict TUs’ boundaries, it is possible to make a rough estimate of their number in a genome. This estimate can be done using only gene predictions, which are usually more than 95% accurate in bacteria. Here, we have to introduce an additional term, “directon” (Salgado et al., 2000), which is a set of genes located consecutively, one after another, on the same DNA strand. All genes of a TU always belong to the same directon because genes must be on the same strand in order to be transcribed together. A directon may include one or more TUs and single genes (one-gene TUs). Figure 1(a) shows a directon that consists of one TU – the flanking TUs have the opposite orientation; that is, they are located on the opposite DNA strand. Figure 1(b) shows another possible situation, where the two neighboring TUs have the same orientation and belong to the same operon. If the orientation of all TUs were completely independent of their flanking TUs, then the average directon would have two TUs (Ermolaeva et al., 2001). Therefore, the size of an average TU would be half the average directon size, which can be easily calculated from the gene predictions. (Note that single genes are counted as one-gene TUs.) Most bacterial and archaeal genomes have some strand bias, with more genes located on the leading strand of replication than on the lagging strand. In this case, an average TU size t can be calculated as

tmp83-85_thumb

where d is the average directon size and b is the strand bias, calculated as the number of genes on the leading strand divided by the number of genes on the lagging strand. We prove this as follows. First, the probability that a randomly chosen TU would be located on the leading strand is

tmp83-86_thumb

A scheme of a transcription unit. (a) The neighboring transcription units are located on the opposite DNA strand. (b) At least two consecutive transcription units are located on the same DNA strand, making it difficult to locate a boundary between them

Figure 1 A scheme of a transcription unit. (a) The neighboring transcription units are located on the opposite DNA strand. (b) At least two consecutive transcription units are located on the same DNA strand, making it difficult to locate a boundary between them

The probability that a randomly chosen directon has exactly n TUs is the probability that the first n TUs are located on the leading strand and the next TU is on the lagging strand plus the probability that the first n TUs are located on the lagging strand and the next TU is on the leading strand:

tmp83-87_thumb

The average number of TUs in a directon is the sum of all possible directon sizes multiplied by their corresponding likelihoods:

tmp83-88_thumb

Combining equations (2), (3), and (4) and reducing the infinite summation yields equation (1).

The next logical step is to find the TU boundaries, that is, to predict which of the neighboring genes are transcribed together and which pairs of the neighboring genes belong to different TUs. For genes located on opposite DNA strands, the answer is obvious: they belong to the different TUs because they cannot be transcribed together. Consecutive genes that are located on the same DNA strand may belong to the same or different TUs.

TU and operon boundaries can, in some cases, be found using the knowledge of metabolic pathways and gene functions (Zheng et al., 2002). Genes that belong to the same metabolic pathway are often regulated together, and are located in the same operon. This method, however, can only be applied to well-studied genes with known metabolic pathways. Another method that relies on experimental data is described in Sabatti et al. (2002). Genes within a TU are transcribed together and, therefore, they have similar levels of expression. Microarray experiments allow us to measure the expression of genes in different conditions and to find genes with correlated expression levels. Although correlated expression does not directly imply that genes belong to the same operon, when coupled with those genes’ adjacency information on the genome it can provide strong evidential support. The main shortfall of this method is the difficulty, using current technology, of obtaining accurate and reproducible gene expression measurements, which means that a large number of experiments are required to detect the correlation.

The two methods described above require extensive experimental data to support any conclusions about TUs. In addition, there are a few “purely computational” approaches that only rely on the DNA sequence data and gene predictions. Two such methods are described below.

The first method is based on calculating intergenic distances; that is, the distances between neighboring genes (Salgado etal., 2000). As illustrated in Figure 1(b), two neighboring genes must be separated by a terminator and a promoter, if they belong to different TUs. Thus, the distance between such genes is usually longer than the intergenic distances within TUs. Thus, genes that are located close to each other are likely to belong to the same TU, while genes that are separated by hundreds of nucleotides are (in prokaryotes) likely to have a TU boundary between them. This amazingly simple method has an impressively high (compared with the other methods) accuracy of 82% for the Escherichia coli genome. This method can be also used for other bacterial and archaeal genomes (Moreno-Hagelsieb and Collado-Vides, 2002), but its accuracy is not always clear. Different genomes have different distributions of intergenic lengths, either due to fundamental differences in their biology or, possibly, to differences in methods of computational gene prediction. Figure 2 shows such distributions for three bacterial genomes: E. coli, Thermotoga maritima, and Synechocystis. In the figure, E. coli has a distinctive maximum for short intergenic distances, most of which are located within TUs. Similarly, T. maritima has an even higher percentage of short distances, but Synechocystis appears to have few short intergenic distances.

It is not clear whether the genomes with longer average intergenic distances have fewer operons or whether the distances between genes within operons are longer. Figure 3 shows that there is a weak correlation (Pearson correlation coefficient 0.26) between the percent of short distances in a genome and the average number of genes in a TU (calculated using equation (1)). We should mention here that intergenic distances may vary significantly due to the accuracy of placement of gene start sites (i.e., the initial ATG start codon), whose prediction varies dramatically depending on the gene prediction method used. This bias, however, does not appear to be the sole reason for the weak correlation: using subsets of genes whose start sites have greater accuracy (known from sequence homology with genes in other species and ribosome binding sites predictions) has only a small effect on the overall picture.

Frequency of intergenic distances (in 10 bp bins) in three bacterial genomes. Only distances between genes located on the same DNA strand were considered

Figure 2 Frequency of intergenic distances (in 10 bp bins) in three bacterial genomes. Only distances between genes located on the same DNA strand were considered

Each dot represents one sequenced bacterial or archaeal genome. Axis x shows percent of short intergenic distances (from —5 to 15 bp) in the genome (only distances between genes located on the same DNA strand were considered) and axis y shows the average number of genes in a transcriptional unit, calculated using equation (1)

Figure 3 Each dot represents one sequenced bacterial or archaeal genome. Axis x shows percent of short intergenic distances (from —5 to 15 bp) in the genome (only distances between genes located on the same DNA strand were considered) and axis y shows the average number of genes in a transcriptional unit, calculated using equation (1)

The second, more computationally sophisticated operon-finding technique is based on finding conserved gene clusters (Bansal, 1999; Overbeek et al., 1999; Huynen et al., 2000; Snel et al., 2000; Ermolaeva et al., 2001; Mering et al., 2003). Bacterial genomes often reshuffle their genes, changing gene order and orientation. Even such evolutionary related genomes as E. coli, Haemophilus influenzae, and Vibrio cholerae have completely different gene orders when viewed at a genome scale. Genes that belong to the same operon – and that are regulated by the same mechanisms – are under greater selective pressure to remain together, even as other genes are shuffled. Therefore, when conserved gene clusters are observed in evolutionarily distant genomes, one can make the inference that these clusters are likely to represent TUs. Figure 4 shows a scheme of a conserved gene cluster that consists of two genes.

Genes A and B are located nearby, within one directon in both genomes, forming a conserved gene cluster. A1 is ortholog of A2, and B1 and B2 are also orthologs

Figure 4 Genes A and B are located nearby, within one directon in both genomes, forming a conserved gene cluster. A1 is ortholog of A2, and B1 and B2 are also orthologs

The specificity of these methods depends on the extent of gene cluster conservation, and may be as high as 98% or higher (Ermolaeva et al., 2001) if the gene cluster is shared by at least a few evolutionary distant genomes. Such methods, however, can locate only a portion of all operons, because many operons are present in only one or two currently sequenced genomes. Fortunately, the predictive power of this comparative genomics strategy will improve steadily over time, as the sequences of more bacterial and archaeal genomes become available.

Next post:

Previous post: