Eukaryotic regulatory sequences (Bioinformatics)

 

1. Introduction

In metazoan organisms, cells have to express specific sets of genes to maintain cellular housekeeping as well as the cell’s specific identity and to respond to external stimuli that induce developmental differentiation, growth, and survival. To achieve the regulatory specificity, transcription of eukaryotic genes is controlled by a complex modular machinery of interacting proteins, which is not yet completely understood. The proteins, termed transcription factors, mediate their effects by targeting specific sites on the DNA nucleotide sequence, the regulatory sequences. The location and distance to the regulated gene differs with the type of the control element. Elements that constitute the core promoter and bind the RNA polymerase II (Pol II) and the general transcription factors are situated in the close vicinity of the transcription start site (TSS) (Butler and Kadonaga, 2002), whereas enhancers can be as distant as 50 kb up- or downstream from the TSS.

The DNA sites a transcription factor binds to do not have a unique sequence as the sites for most restriction enzymes (Stormo, 2000). The sequence patterns are degenerate so that a factor recognizes a family of sequences with variable affinity. In general, the sequences are short (5-15 bp) and occur frequently in the genome by chance (Bailey and Noble, 2003). This makes any in silico identification of “real” functional sites difficult and implies that additional mechanisms are involved in specific transcriptional regulation. According to the commonly accepted models, these are mainly the combination of sites to cis-element clusters (Frith et al., 2001; Bailey and Noble, 2003) and the accessibility of the clusters for transcription factors that is regulated by DNA methylation and chromatin structure (Wagner, 1999).

We give a short overview on the best characterized types of regulatory sequences and modules and on computational methods that have been applied for their genome-wide prediction, a knowledge that would bring enormous benefits for uncovering the network of transcription regulation and identifying gene targets for therapeutic intervention (Frith et al., 2001). Currently, prediction of promoters is far more successful in comparison with enhancer prediction, which is not yet possible.

2. Regulatory sequences and clusters

The core promoter of a eukaryotic gene is defined as the minimal part of contiguous DNA that is recognized by the preinitiation complex (PIC) and is sufficient for transcription initiation. The PIC consists of Pol II and the general transcription factors TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH (Fickett and Hatzigeorgiou, 1997) that are multisubunit complexes themselves. The most prominent regulatory site is the TATA box that is typically located 25-30 bp upstream of the TSS. It binds TBP (a part of TFIID), but as all other sites it is not universal and is present only in a subset of core promoters (Butler and Kadonaga, 2002). The initiator (Inr) encompasses the TSS and is recognized by several transcription factors. The downstream core promoter element (DPE) is frequent in TATA-less promoters and is located at an invariant distance to the Inr 28-32 bp upstream of the TSS, whereas the TFIIB recognition element (BRE) is present immediately upstream of some TATA boxes (Butler and Kadonaga, 2002). CpG islands are stretches of GC-rich DNA (size of 0.5 to 2kb) that contain multiple GC-box motifs as target sites of Sp1 and related transcription factors. They can typically be found in TATA and DPE-free core promoters of housekeeping genes and initiate transcription from multiple weak start sites. Thus, core promoters are different combinations of control elements and can as such be recognized by specific enhancers (Butler and Kadonaga, 2002).

The proximal promoter is the region ranging from -250 to +250 bp in relation to the TSS (Butler and Kadonaga, 2002). Together with the core promoter, it constitutes the promoter. It contains multiple binding sites for a subset out of the 2000 transcription factors estimated for humans. Alone, some of the sites are nonfunctional, but rather need the synergistic/antagonistic combination with one or more additional sites and are called composite elements (Kel-Margoulis etal., 2002). Single sites and composite elements tend to be clustered as higher order cis-regulatory modules, which may be a key for efficient prediction (Bailey and Noble, 2003). This model is supported by the fact that transcription factors often form multimeric functional complexes.

Enhancers are long distance (up to +/- 50 kb) regulatory elements with a length of 50-1500 bp that harbor multiple protein-binding sites targeted by a multiprotein complex, the enhancesome. The enhancesome can augment gene transcription via interaction with the promoter-bound multiprotein complex (Blackwood and Kadonaga, 1998). When the effect is repressive, the enhancer is called a silencer.

To prevent that enhancers regulate other unrelated genes, their influence is blocked by special regulatory sequences, termed insulators. They also protect a gene against the inactivating impairment by adjacent heterochromatin (Burgess-Beusse et al., 2002).

Also acting as boundary elements are the scaffold/matrix-attached regions (S/ MARs), which are the anchor points for chromatin loops at scaffold/matrix proteins. Amongst other proteins, they also bind transcription factors and are involved in gene expression regulation (Schubeler et al., 1996).

Another distal regulatory sequence is the locus control region (LCR) that has influence on several genes of a locus, for example, the globin locus. For exerting its effects, it recruits a holocomplex out of chromatin-modifying enzymes, coacti-vators, and transcription factors that can make promoters accessible and competent for subsequent transcriptional activation by other control elements (Levings and Bungert, 2002; Blackwood and Kadonaga, 1998).

3. Prediction

A transcription factor tolerates significant variation in the sequences of its DNA binding sites. Different formats have established that try to reflect this pattern of binding specificity: consensus sequences, positional weight matrices (PWM), and profile hidden Markov models (HMM). A consensus sequence indicates the most preferred base at each position, also including the IUPAC alphabet for ambiguous bases. Its utility for site prediction has limitations due to the ambiguities in the pattern that give rise to a high number of false-positive matches on one side (Stormo, 2000) and to its rigidity causing a high rate of false-negatives on the other (Quandt et al., 1995). A PWM assigns a weight to all bases at each position of a site and returns a site score that is the sum of these weights. Ideally, the score gives an estimation of the free energy of the protein-DNA binding (Fickett and Hatzigeorgiou, 1997). A prerequisite, and often a bottleneck, for matrix construction is the availability of an adequate number of experimentally proven binding sites for a particular factor. Collections of known binding sites can be obtained from databases such as TRANSFAC (Matys, 2003).

HMMs are generative models for describing the probability distribution over a family of sequences. They are suited to model substitutions, insertions, and deletions very well, with the limitation that dependencies between particular positions in the sequence cannot be included (Eddy, 1998).

PWMs and profile HMMs can identify in vitro target sequences accurately but cannot predict sites with in vivo function alone. Besides knowledge about chro-matin structure and protein-protein interactions, the identification of regulatory clusters and the construction of predictive models can help. Several algorithms for predicting regulatory clusters have been proposed (e.g., Bailey and Noble, 2003; Frith et al., 2001; Wagner, 1999). They can be grouped in three classes, two of which are generative in that they rely upon a rule set of a cluster model. They use a sliding window approach and HMMs, respectively. The third approach is discriminative and models the difference between regulatory and nonregulatory sequences (Bailey and Noble, 2003).

Phylogenetic footprinting for the identification of conserved regions in orthol-ogous gene sequences may give supporting evidence for predicted transcription factor binding sites (Lenhard et al., 2003). Sets of coregulated genes identified by gene expression arrays are important sources for the systematic analysis of regulatory sequences by pattern matching and approaches as explained above, but also for pattern identification algorithms that may find new regulatory elements (see Article 28, Computational motif discovery, Volume 7).

Next post:

Previous post: