RNA Structure Prediction (Molecular Biology)

RNA molecules play many roles in the cell, and their biological functions are definitely not restricted to the relatively simple reading and decoding of the sequence of bases along the polynucleotide backbone that is exemplified by a simple view of messenger RNA. On the contrary, many RNA molecules exert their biological activities through the chemical architectures they form, as polypeptide chains do (see Protein Structure). In aqueous solutions containing simple cations like sodium or magnesium ions, single-stranded RNA molecules, depending on their base sequence, either remain mostly unstructured (or randomly coiled) or they fold back on themselves to form complex three-dimensional (3-D) structures. The driving force for the folding is the stacking between the bases, which minimizes their exposure to water, with the specific molecular recognition between RNA segments occurring overwhelmingly by Watson-Crick base-pairing mediated by hydrogen bonds. Some RNA molecules are able to self-assemble into complex objects because they also contain additional tertiary contacts between segments of the polynucleotide chain (eg, the self-splicing group I introns). RNA molecules can also be observed in states containing only the secondary-structure double helices, without tertiary interactions, and they often need protein cofactors for folding into their biologically active conformations in vivo.

1. Partitioning Between Secondary and Tertiary Structures

The term "secondary structure" includes all segments that can build double-stranded helices by any combination of the isosteric Watson-Crick pairings, but this has some ambiguity. A secondary structure can be broken down into recurrent separable elementary motifs, such as the helical regions (stem structures, pseudoknots) and linking nonhelical elements (hairpins and internal loops, bulges and multiple junctions) (Fig. 1). In secondary structure, a pseudoknot is a specific RNA motif that results from standard Watson-Crick pairing between a single-stranded stretch that is located between two paired strands and a second, distal single-stranded region. The single-stranded regions may belong to a hairpin loop, an internal loop, or a 3′ (or 5′ ) dangling end, but at least one of them must occur between strands that form a double helix. When both single-stranded regions belong to a hairpin loop, they are said to form a loop-loop motif, which is formally equivalent to a pseudoknot. On the other hand, the two-dimensional (2-D) structure reduces the secondary structure to the set of Watson-Crick base pairs that form a planar graph (ie, without crossing edges) when the sequence of bases is arranged along a circle and the base pairs are connected by edges. Consequently, pseudoknots should be considered as belonging to the secondary structure, perhaps even the tertiary structure. This distinction is important, since the most efficient and most frequently used algorithms for predicting the two-dimensional structure do not take into account pseudoknots. At the next level of organization, that of the active tertiary structure, a three-dimensional architectural motif is a simple but recurrent arrangement containing a few secondary-structure elements that interact with a specific geometry and topology. The combination of such substructures leads to compact domains, which often fold autonomously and independently of the rest of the RNA architecture. A variety of observations support a view of RNA folding in which the three-dimensional architecture results from the cooperative compaction of separate, preformed, and stable substructures, which might undergo only minor and local rearrangement during the process. In summary, the secondary structure implies a local folding, whereas the tertiary structure refers to the global three-dimensional architecture of the RNA. The introduction of modular units, hierarchically organized and folded, circumvents most of the numerical nightmares inherent in the purely mathematical prediction of RNA structure.


Figure 1. (a) Secondary structure with the sequence arranged within a circle graph. A pseudoknot element (represented ; the graph nonplanar. ( b) The more conventional representation shows basic structural elements. The free energy of the s the contributions of all the elements. According to Nussinov’s first proposition (8), each structural element is a base pair. which is complex but very powerful, each structural element is a Moop ended by a closing base pair (black triangles in t interior base pairs of the loop (black squares in the figure). In Figure 1b, five different Moops are represented and labele a 0-loop (no interior pair); two stacked base pairs, an internal loop, and a bulge form 1-loops (one interior pair); and junc pairs is shown in the example). The black circles have no particular meaning.

 (a) Secondary structure with the sequence arranged within a circle graph. A pseudoknot element (represented ; the graph nonplanar. ( b) The more conventional representation shows basic structural elements. The free energy of the s the contributions of all the elements. According to Nussinov's first proposition (8), each structural element is a base pair. which is complex but very powerful, each structural element is a Moop ended by a closing base pair (black triangles in t interior base pairs of the loop (black squares in the figure). In Figure 1b, five different Moops are represented and labele a 0-loop (no interior pair); two stacked base pairs, an internal loop, and a bulge form 1-loops (one interior pair); and junc pairs is shown in the example). The black circles have no particular meaning.

Energetically, the secondary structure is the main component of RNA architecture, while tertiary structure contributes only slightly to the Gibbs free energy stability of the native state. Therefore, determination of the secondary structure is an essential step in the study of the structure-function relationships of an RNA molecule. The state in which an RNA molecule exists can be monitored by UV absorption spectroscopy as a function of temperature, RNA concentration, and the concentrations of ions. A molecule with definite secondary and tertiary structures will normally display two melting peaks: a first sharp low-temperature transition corresponding to the melting of the tertiary structure, and a second broad high-temperature transition to the melting of the secondary structure (1). In a complex and compact three-dimensional structure, the sugar-phosphate chain folds several times on itself and, by necessity, negatively charged phosphate groups come into close contact. In order to relieve the resulting electrostatic repulsions, positively charged cations are necessary. Biologically, the most prevalent and efficient cations are magnesium ions and polyamines. Magnesium cations are "hard" ions that interact favorably with the "hard" negatively charged oxygen atoms of the phosphate groups. Polyamines always carry positively charged amino groups, and their flexibility and small size allow them to snuggle in helical grooves and in-between helical sugar-phosphate backbones. By monitoring the UV absorbance at low and high magnesium concentrations, one can distinguish the tertiary melting peak (it moves to still lower temperatures with decreasing concentrations of magnesium ions) from the secondary peak (it remains rather invariant). Native gel electrophoresis at various temperatures also distinguishes between folded and unfolded RNA molecules. Most important, electrophoretic methods are useful to ascertain the presence (and to evaluate the yields) of dimeric or higher oligomeric RNA species. The measurement of UV absorbance at various RNA concentrations allows one to calculate the concentrations of dimers and monomers and to measure the melting temperature of both dimers and monomers. Oligomerization may also depend on the concentrations of cations or polyamines.

2. Prediction of Secondary Structure

Computer programs exist that automatically produce a set of possible two-dimensional structures for a given RNA sequence. All are based on thermodynamic considerations, and the subsequent optimization of a set of criteria approximating the total free energy. In most cases, however, the selection of one secondary structure from the several that are usually produced by these programs and are indistinguishable in energy within the approximations used requires additional chemical or biological information stemming from experimental probing (biological approach) or comparisons of related sequences (phylogenetic or comparative approach). When a set of homologous sequences is available, which have common ancestry and function, one can search for a consensus core of secondary-structure elements, common to all the sequences, which should include a consensus three-dimensional architecture with the given function. In practice, one aims to organize the sequences so that the Watson-Crick paired regions align vertically, by searching for base covariations, ie, regions demonstrating compensatory base changes (eg, an A-U pair changing into a C-G pair) horizontally in all sequences. The more compensatory base change events there are in the sequences, the more firmly the secondary structure will be established. The efficiency of the comparative approach stems from the fact that molecular three-dimensional architectures evolve much more slowly than sequences. However, both the thermodynamic and phylogenetic methods are fraught with problems related to statistical relevance. With only four bases to choose among, purely coincidental compensatory base changes (or covariations between positions) are bound to occur. In the thermodynamic approach, they are mathematically resolved on the basis of the given set of thermodynamic parameters. Phylogenetically, the level of ambiguity can be reduced with additional sequences presenting additional covariations. For the establishment of a two-dimensional structure, and especially of a secondary structure, one should ideally employ all three of the approaches described above, with the caveat that the experimental data be gathered under conditions that favor a stable and functional structure. In practice, a comparison of foldings based solely on thermodynamics with the structures of 16S and 16S-like ribosomal RNA derived by the comparative approach shows that the quality of the thermodynamic predictions is variable, with percentages of correctly predicted base pairs ranging between 10 and 90% (2, 3).

2.1. The Thermodynamic Approach

The thermodynamic approach is based on the empirical observation that the two-dimensional structure of an RNA molecule can be decomposed into elementary motifs that are identifiable and recurrent. It originates in the work of Tinoco and co-workers (4), who first generated from UV absorbance melting studies a set of thermodynamic parameters for the stability of structures formed by short oligonucleotides. Structural stability is measured by the decrease in free energy accompanying the transition from the unfolded or denatured state to the native state. The total decrease in free energy is equal to the sum of the independent contributions from each elementary motif present in the structure (the Tinoco-Uhlenbeck approximation), if we assume that tertiary interactions are weaker than secondary interactions and the sum of free energies of secondary elements is a reasonable approximation of the total free energy. The stability of a given base pair depends only on its immediate neighbors (nearest-neighbor approximation), considering that the stability is essentially related to base stacking and hydrogen bonding. The base pairs considered are the standard Watson-Crick pairs, G-C and A-U, as well as the often functionally important wobble pair G-U (5, 6).

When a single sequence is available, computational methods rely on the assumption that the native secondary structure is based on the two-dimensional structure with the lowest energy, or at least belongs to one of the suboptimal predicted two-dimensional structures (7). Stereochemistry requires that at least three nucleotides separate two paired strands. As mentioned above, an additional constraint is the lack of prediction of pseudoknots. Thus, because efficient algorithms are usually based on the decomposition into substructures (which is mathematically not possible in the presence of pseudoknots), one solution is to account for pseudoknots in a second step, either visually or by algorithmic fudging.

Efficient computer programs use O(N ) recursive algorithms based on dynamic programming principles to fold sequences containing up to one thousand bases. A dynamic programming algorithm solves the problem by combining solutions corresponding to subproblems. It solves every subproblem just once and then saves the answers in a table, thereby avoiding the work of recomputing the answer every time the subproblem is encountered. The dynamic programming approach of RNA folding was first proposed by Nussinov and Jacobson (8) on the basis of the decomposition of a structure into base-paired structural elements (Fig. 2). An elegant decomposition into elementary structures proposed later by Zuker (Fig. 1) has made it possible to consider more complete thermodynamic data, at the price of increased computational time and storage requirements, depending on the way in which energy is assigned to loops (7). Both the free energies of the base pairs and additional experimental data are encoded in the energy function, resulting in the fast computation of the optimal two-dimensional structure.

Figure 2. Illustration of the principles underlying the dynamic programming approach. (a) The simplest recurrence proposed by Nussinov (8) for maximizing the number of base pairs. The same principle is applied by programs developed by Zuker (7), which include a more complex recurrence including loop consideration. (b) A schematic view of the principle of recurrence. For each increasing i, j subsection, the variable k is allowed to assume each position from i to j-1 to test the ability of base k to pair with base j. (c) At each point, the total number of base pairs in the section i, j is computed. For each subsection, the maximum number of pairs that can be formed is saved in M(i, j), and the value of k that yields this number is saved in M( j, i). If j cannot pair with any k in the subsection, M(ij) = M (i j-1). The maximum number of base pairs that can be formed in the folding of the example is given by reading M(14, 1). ( d) The secondary structure is obtained by reading in the upper half matrix partners that give the maximum number of pairs in the considered subsection.

Illustration of the principles underlying the dynamic programming approach. (a) The simplest recurrence proposed by Nussinov (8) for maximizing the number of base pairs. The same principle is applied by programs developed by Zuker (7), which include a more complex recurrence including loop consideration. (b) A schematic view of the principle of recurrence. For each increasing i, j subsection, the variable k is allowed to assume each position from i to j-1 to test the ability of base k to pair with base j. (c) At each point, the total number of base pairs in the section i, j is computed. For each subsection, the maximum number of pairs that can be formed is saved in M(i, j), and the value of k that yields this number is saved in M( j, i). If j cannot pair with any k in the subsection, M(ij) = M (i j-1). The maximum number of base pairs that can be formed in the folding of the example is given by reading M(14, 1). ( d) The secondary structure is obtained by reading in the upper half matrix partners that give the maximum number of pairs in the considered subsection.

Nevertheless, the predictions depend strongly on the thermodynamic parameters used, which argues either for a lack of reliability of the optimization approaches or for incompleteness and imprecision in the energetic parameters. Although a better knowledge of the underlying thermodynamic parameters would certainly lead to better predictions, it is worth remembering the starting assumptions and limitations of the mathematical modeling. First, the Watson-Crick and wobble base pairs are not the only base-base interactions occurring between two RNA strands: Various A-A or A-G pairs, Hoogsteen base pairs, or pyrimidine-pyrimidine pairings are frequently present in internal loops eg, loop E of eukaryotic 5 S rRNA (9) or the SECIS element in selenocysteine-coding mRNA (10) (see Figs. 3 and 4) or hairpin loops (eg, the thymine loop of transfer RNA (11) or the -GNRA-tetraloops (12)). The three-way junction of the hammerhead ribozyme is organized as well around a core of tandem sheared A-G pairs and a non-Watson-Crick A-U pair (13). With pseudoknots, one of the two constituting helices will, in a favorable situation, be predicted on the basis of its stability, and visual inspection of the remaining single-stranded regions normally points to the second helix. However, regions that should form noncanonical base pairs are not considered as such by the prediction programs, but as potential Watson-Crick base-pairing regions. Thus, depending on the relative weights of the various helices, such noncanonical regions will be involved in standard helices (or in "internal loops" penalized by positive free energy increments) leading to incorrect local (and sometimes global) predictions. The manual or automated analysis of energy dot plots (14-16) offers an alternative to test new folding models and to guide the RNA sequence folding according to available information. In short, those prediction programs should be primarily used for obtaining a solid framework on which further refinements can be applied.

Figure 3. Sequence alignments of mRNAs coding for various selenocysteine containing proteins (Gpx = glutathione per The paired segments, denoted Helix I/I and Helix II/II corresponding to the 5 and 3 strands, are underlined. The invari loops are highlighted (black on grey). At the base of helix II, the quartet made of four non-Watson-Crick pairs that const Insertion Sequence (SECIS element) is boxed and highlighted (white on black). Gaps have been introduced to maximize structure elements and sequence similarities.

Sequence alignments of mRNAs coding for various selenocysteine containing proteins (Gpx = glutathione per The paired segments, denoted Helix I/I and Helix II/II corresponding to the 5 and 3 strands, are underlined. The invari loops are highlighted (black on grey). At the base of helix II, the quartet made of four non-Watson-Crick pairs that const Insertion Sequence (SECIS element) is boxed and highlighted (white on black). Gaps have been introduced to maximize structure elements and sequence similarities.

Figure 4. Schematic view of the secondary structure of the SECIS element corresponding to the sequence alignment of I tandem sheared G-A pairs [involving hydrogen bonds between N2(G) and N7(A), as well as between N3(G) and N6(A)] between pyrimidines.

Schematic view of the secondary structure of the SECIS element corresponding to the sequence alignment of I tandem sheared G-A pairs [involving hydrogen bonds between N2(G) and N7(A), as well as between N3(G) and N6(A)] between pyrimidines.

The comparative approach is based on the assumption that the function and folding architecture have and, consequently, that a consensus secondary structure should be derivable by comparing RNA sequ of choice when a set of homologous sequences is available for RNA with the same biological functic arranged in groups and subgroups (ideally of similar size), either according to the phylogenetic class parsing. The overall robustness of the approach increases with the diversity of the sequences and the them, whereas the accuracy of each prediction depends on the number of covariation events in each g in an alignment consists of establishing the paired regions along each sequence, which should be arr; lengths of the paired regions juxtapose vertically (see the example of Fig. 3). In a second step, the co bases can be highlighted by a vertical alignment with the inclusion of blanks or gaps in a fashion sim sequences (see Figs. 3 and 4).

Finding the conserved core of the secondary structure is a difficult task that brings together, within a of aligning sequences and assessing pairs. The first task is usually done by hand, with the help of coi similarity search (17, 18). The second task consists of searching for nucleotide interactions by measu RNA positions in the alignment (19, 20). It is most effective when an alignment is available but, at tl or readjusting the alignment. Computationally expensive approaches that attempt to find a conservec convergent cycle, combining the alignment search and secondary-structure assessment, have also bee usually aligns one sequence to a hypothetical secondary-structure model (21).

3. Prediction of Tertiary Structure

Although structure is not yet sufficient to predict function, a proper understanding of the functional m macromolecule requires knowledge of its precise molecular organization in space. In the absence of structure, molecular modeling attempts to construct and propose a three-dimensional architecture for mixture of theoretical and experimental data. Hence, prediction methods range from the most mathem solely on computer algorithms, to the most pragmatic and operational ones, in which insights come a experiment. Modeling is best considered a heuristic tool that should help in the rationalization of exp and most important, should suggest new relations between the various components of the modeled m three-dimensional model, mutagenesis of a macromolecule will, by necessity, be somewhat random, informative. At best, mutagenesis experiments performed under such conditions will confirm the sec molecule, since there is no tertiary model able to organize the data at a higher level. Such experimen bootstrapping a three-dimensional structure, which will serve as a framework for organizing existing mutagenesis experiments.

Construction of the tertiary structure of an RNA molecule always assumes and starts from a given se tertiary contacts can be gained through chemical modifications (the importance of specific atomic pc protections cannot be explained solely by the secondary structure; see Footprinting nucleic acids), b yield directly the partners if we assume a single conformer in solution), and most efficiently by caref in the case of the prediction of the secondary structure, the approaches can be divided into those rely algorithms for automatic folding prediction and those relying on previously accumulated knowledge One should weigh the advantages of mathematical objectivity and automation while, on the other han biased human decisions that include the weighing and integration of highly variable, diverse, and soi distance geometry method (22) falls in the first category. Problems do occur when applying this met] chiralities and for avoiding knots in the structures.

Another method (23) exploits a pseudoatom approach, with either one pseudoatom per helix or one p Appropriate potential functions have been developed. The use of spherical pseudoatoms leads to a lo fragments and, most important, all fine interactions that control RNA folding are ignored. A third app satisfaction algorithm, which searches conformational space so that, for a given set of input constraii pairs, distances), all possible models are produced (24). The manual approach involves the extensive crystallography and NMR structures. With developments in the production and purification of RNA sequence (either synthesized chemically or with the bacteriophage T7 RNA polymerase), new crysta published at an increasing pace. The structures can be used to extract the structure of a fragment of ii then assembled manually on a computer graphics screen, using interactive modeling procedures, and restrained least-squares minimization, molecular mechanics, or molecular dynamics programs (25). r imply some human judgments that ultimately depend on the available database, as well as the stereo< knowledge of the modular. However, the human mind can quickly grasp three-dimensional relations solutions, and take into account diverse experimental data. The solvent-accessible surface of the fina validate the structure against experimental reactivities of specific positions to chemical reagents (26)

4. Perspectives

Several biologically important RNA families have been identified and their secondary and tertiary st approaches described above. In some cases of low conservation of bases along the sequence, the kno sequence and the secondary (or even the tertiary) structure can be viewed as a biological signal to sea to increase and refine the knowledge of a given signal or to identify genomic sequences as specific E tRNA is certainly the best example, since it is now possible to scan new genomic databases with prec tRNA sequences (27). In the future, it will be necessary to scan genomic sequences rapidly to identif RNA. Some programs already offer a dedicated language to specify and search for such sequence-sti increasing pace at which three-dimensional RNA structures are produced, such developments should secondary and tertiary structures of RNA sequences identified as functionally important in genomes.

Next post:

Previous post: