1. Biological problem and importance for practice and science
Complex processes in cells of living organisms depend on synchronous actions of different groups of genes. Coordination of gene expression is achieved to a large extent by different transcriptional control mechanisms characteristic for each gene and controlling timing, rate, and level of its transcription. Promoters represent genomic regions containing many such regulatory signals. Polymerase II promoters are located at the beginning of the genomic region whose transcription they control. The boundaries of promoters are not very clear, but most important transcriptional signals known today are generally located within the segment of [-2000, +500] relative to the transcription start site (TSS), which often already includes proximal enhancers. The +1 position corresponds to the position of the first transcribed nucleotide.
Accurate promoter determination can help to (1) identify new genes; (2) complement annotation of known genes and identify their 5′ boundary; (3) predict alternative transcripts initiating from additional promoters; (4) localize the most important control region for gene activation; (5) annotate transcriptional regulatory patterns; and (6) determine organizational promoter models of different gene groups, which ultimately can be used to determine the cause-consequence relationships characteristic of different gene activation pathways and gene regulatory networks. The bottom line is that determining the location of promoters can ultimately contribute toward the understanding of the molecular control of activity of the respective genes.
2. Current possibilities for locating promoters
There are two basic categories of techniques allowing for large-scale identification of promoters: experimental and computational methods. The most promising experimental techniques are based on oligo-capping (Maruyama and Sugano, 1994), where the cap structure of mRNAs is selectively replaced by an oligonucleotide, which subsequently serves as sequencing primer to determine the 5′ sequence of the transcript. More than 83 000 human and murine transcripts have been characterized that way to date. Other experimental techniques include generation of EST sequences and mRNA fragments, and require back-mapping to original DNA genomic sequences. Unfortunately, none of these methods is perfect and even the best (oligo-capping) produces about 20 to 30% false TSS data.
Computational detection of promoters has also advanced considerably recently and is reaching the level where the amount of correct promoter prediction is severalfold or higher than the amount of false-positive promoter prediction in large-scale genomic searches (Scherf et al., 2000; Bajic and Seah, 2003).
3. Current computational solutions
We will make comments about promoter prediction programs (PPPs) for eukaryotic promoters developed or significantly modified in 1999 and after. Two reviews (Fickett and Hatzigeorgiou, 1997; Prestridge, 2000) discussed earlier PPP solutions. We will comment on CONPRO (Liu and States, 2002), CpGProD (Ponger and Mouchiroud, 2002), CpG-Promoter (Ioshikhes and Zhang, 2000), Dragon Promoter Finder (DPF) (Bajic et al., 2003), Dragon Gene Start Finder (DGSF) (Bajic and Seah, 2003), Eponine (Down and Hubbard, 2002), First Exon Finder (FirstEF) (Davuluri et al., 2001), McPromoter (Ohler et al., 2001), NNPP2.1 (Reese, 2001), Promoter2.0 (Knudsen, 1999), PromoterInspector (Scherf et al., 2000), PromH (Solovyev and Shahmuradov, 2003), and the system of Hannenhalli and Levy (SHL) (Hannenhalli and Levy, 2001).
4. Biological signals
There are several types of biological signals utilized to enhance computational promoter predictions.
The most significant are CpG-islands (Bird et al., 1986), relatively short stretches of DNA characterized by high amount of CpG dinucleotides containing high amounts of unmethylated C nucleotides. Unmethylated C nucleotides facilitate factor binding to genomic DNA during transcriptional activation and initiation. CpG-islands are frequently found close to the gene starts and thus close to promoters. CpG-islands are characteristic for a large proportion of vertebrate genomes (Cross and Bird, 1995). PPPs that use the concept of CpG-island are CpGProD, CpG-Promoter, DGSF, FirstEF, and SHL.
Another significant characteristic, at least of the human genome, is an elevated GC content around the TSS, across the first exon, and around the first splice donor site as opposed to other parts of the genome (Louie et al., 2003). These properties are only partially related to CpG-islands. Moreover, there is a strong bias in the nucleotide content in 5′ region of genes (Louie et al., 2003; Majewski and Ott, 2002), including CpG dinucleotide and GGG trinucleotide. PPPs that utilize different GC content of promoters are DPF, DGSF, Eponine, FirstEF, and together with other features also PromoterInspector.
The third characteristic of promoters that is more universal across different taxa are sets of specific combinations of promoter elements (PEs) (transcription factor binding sites and promoter boxes, and their combinatorial and positional distribution patterns) for individual promoters and promoter groups of coregulated genes (e.g., Fessele et al., 2002). PPPs that use different combinations of promoter elements are McPromoter, NNPP2.1, and Promoter2.0.
The fourth characteristic of promoter regions versus nonpromoter regions can be found in different densities of potential PEs in these regions. Independent studies of authors suggest a number of overrepresented PE patterns, such as E2F, ETF, Elk-1, and ZF5, in human promoters as opposed to nonpromoters. However, as such overrepresentation is method/data dependent in most cases, this may impact performance of PPPs that utilize this promoter characteristic.
In addition, there are some other physicochemical properties that distinguish promoters from nonpromoters, such as DNA bendability, propeller twist, and so on (Ohler et al., 2001). A PPP that uses these properties in combination with other features is McPromoter.
PPPs employ numerous types of biological information, and currently, the most efficient systems utilize some of the key biological characteristics of promoter regions or their combinations.
5. Implemented technology
PPP implementations include position weight matrices and their higher-order derivatives (in DPF, DGSF, FirstEF, Eponine); various artificial intelligence approaches such as artificial neural networks (DPF, DGSF, McPromoter, NNPP2.1, Pro-moter2.0, PromoterInspector), Interpolated Markov Models (McPromoter), and relevance vector machine (Eponine); some standard statistical techniques such as linear (SHL, PromH), and quadratic discriminant analysis (FirstEF, CpG-Promoter); as well as comparison with orthologous promoter sequences (PromH). Some solutions use combinations of different separate solutions (CONPRO).
6. Information provided to end users
The existing PPPs provide different information to the end user. Prediction of promoters can be strand-specific (DPF, DGSF, Eponine, FirstEF, CONPRO, McPro-moter, NNPP2.1, Promoter2.0, PromH) or non-strand-specific (CpG-Promoter, CpGProD, SHL, Promoterlnspector). In the first case, the direction of the body of the transcript is predicted, while in the latter case only the genomic locations of its 5′ end is predicted, with no direction for the transcript. Some programs attempt to predict the actual TSS locations (DPF, DGSF, Eponine, FirstEF, CONPRO, McPromoter, NNPP2.1, Promoter2.0, PromH), while others predict a region expected to contain a TSS (CpG-Promoter, CpGProD, DGSF, SHL, Promoterlnspector). In addition to these basic data, some PPPs also provide additional information such as potential TSBSs in the region of interest (DPF, DGSF, PromH), information on other genomic signals, such as first donor (FirstEF), CpG-islands (FirstEF, CpG-ProD, CpG-Promoter), genomic location corresponding to the translation initiation codon (DPF), and so on.
7. Current performance
We tested several PPPs on three whole human chromosomes (4, 21, and 22). Details of the experiment were as presented in Bajic and Seah (2003), with the only difference being that we merged predictions of individual systems if they were no more than 1000 nucleotides apart (except for PromoterInspector). The performances in percentage for PPPs are given in the form (sensitivity, positive predictive value): DPF (for threshold 0.5) (80.70, 22.21), DGSF (65.04, 77.27), Eponine (41.04, 85.20), FirstEF (78.43, 48.08), McPromoter (for threshold -0.005) (55.65, 70.95), NNPP2.1 (for threshold 0.99) (60.52, 7.40), and Promoter2.0 (60.52, 4.61). PromoterInspector achieved (42, 52) and (11, 96) on the complete human and mouse genomes, respectively, measured relative to the known transcripts (not gene numbers) and including a large amount of alternative transcripts usually not considered.
8. Future solutions: the next generation
Since the large-scale localization of TSSs in mammalian genomes has reached maturity and levels of ~80% of correct promoter predictions falling into region [-500, +500] relative to real TSS (Bajic and Seah, 2003), there are two immediate next targets for future developments of promoter prediction. One goal would be the more position-accurate TSS prediction, hopefully in the range of [-20, +20] relative to the real TSSs, and the ability to accurately distinguish alternative TSSs. The second goal, naturally, is to increase the sensitivity of predictions while preserving the specificity reached so far. This will be very important in the light of a huge amount of alternative transcripts representing an important body of biologically relevant information.
Another issue will be the extension of high-accuracy predictors to nonmammalian species, including other vertebrates, invertebrates, and plants. These extensions may not necessarily be simple, as, for example, in Fugu rubripes the GC properties of promoters and nonpromoters are very different compared to the human genome. Many of the nonmammalian species do not have high proportion of unmethylated CpG-islands characterizing gene starts, while typical promoter elements such as TATA and CCAAT boxes may appear in completely different proportions than in mammalian genomes. Furthermore, the GC content of different genomes will influence our ability to effectively search for TSSs computationally.