Threading Protein Sequences (Molecular Biology)

Formulation of the inverse folding problem, in which the amino acid sequence, or primary structure, most compatible with a given three-dimensional protein structure is sought, provided a clue to the problem of protein structural prediction, in which the three-dimensional structure is to be predicted from just the primary structure. Given a query sequence a of a protein of unknown structure (A), mount the sequence onto a number of known protein structures (B, C, D, etc.) one-by-one, using all feasible alignments of sequence to structure. Then, evaluate the fitness of sequence a to each one of the structures to find the structure most compatible with the sequence a (Fig. 1). If the structure so found is B, it is inferred that the structure A to be predicted is probably similar to B. This is called the "3D-1D compatibility method," "fold recognition," or "prediction by threading." It is an extension of the conventional comparison of sequences alone (1D-1D), but in principle is more effective in detecting similarities between proteins because the 3-D structure is more conserved during evolution than the sequence (see Homological Modeling). It is observed empirically that homologous natural proteins that arose by divergent evolution from a common ancestor and belong to a gene family share a common polypeptide fold. Moreover, the number of folds found in natural proteins is believed to be fairly limited (1). Because of the rapid growth of the protein structural database, it was logical for the threading approach to emerge. The logic of supposing that a known structure can be a model for a new protein arose from the inverse folding problem, but the difference lies in that the threading method considers a certain (query) sequence in relation to all possible model structures, whereas inverse folding considers a model structure in relation to all the possible sequences.


Figure 1. Protein structure prediction by threading. A query sequence is to be threaded through each of template structures taken from the structural database, and the sequence/structure fitness is quantitatively evaluated with compatibility functions to find a structure most compatible with the sequence.

Protein structure prediction by threading. A query sequence is to be threaded through each of template structures taken from the structural database, and the sequence/structure fitness is quantitatively evaluated with compatibility functions to find a structure most compatible with the sequence.

In threading a given 1-D sequence through a given 3-D structure, there are a large number of possibilities, which is known as the alignment problem. Bowie (2, 3) was the first to solve this problem by applying the dynamic programming technique that had been fully developed in the field of aligning sequences, so-called homological searches (see Sequence Analysis). To reduce the magnitude of the computational problem, Bowie et al. (3) introduced the 3-D profile table, which was constructed from the model structure (Fig. 2). This is a (20*n) table, where columns of 20 amino acid residues are arrayed along the n residues of the structure. Each number in the table gives the fitness of the respective amino acid residue for a given residue site, which depends on its secondary structure, hydrophilic/hydrophobic environment, etc. Given such a 3-D profile table, it is straightforward to compare it with any sequence by using the dynamic programming algorithm and to obtain the optimum path, including gaps (i.e., the best 3-D-1-D alignment), plus the alignment score. In actual predictive procedures, a structural library that has been converted to a set of 3-D profiles is scanned with a query sequence to seek the model structure and alignment that gives the highest score (or lowest energy). If the score obtained is sufficiently high, a convincing prediction has been obtained. The computational time is no greater than that for the usual sequence homological search methods.

Figure 2. Construction of the 3-D Profile. (a) Each of 20 amino acids is placed one-by-one at a site of a given 3-D structure of a protein to evaluate its compatibility with the structural environment. A, C, D . . in the figure are one-letter codes of amino acids standing for Ala, Cys, Asp. ., respectively. The results are tabulated in a profile table (b). The example in this table was constructed from sperm whale myoglobin structure (PDB code: 1mbd). The profile table for the first 10 residue sites (one row corresponding to one site) shown consists of columns of the length equal to the total number of residue sites in the structure. In the profile table of this illustration, the 20 amino acids are sorted from left to right at each site, according to the compatibility score. Those amino acids in the native sequence that are highlighted sit on the left-hand side of the table, implying that they are energetically favorable to the structure.

Construction of the 3-D Profile. (a) Each of 20 amino acids is placed one-by-one at a site of a given 3-D structure of a protein to evaluate its compatibility with the structural environment. A, C, D . . in the figure are one-letter codes of amino acids standing for Ala, Cys, Asp. ., respectively. The results are tabulated in a profile table (b). The example in this table was constructed from sperm whale myoglobin structure (PDB code: 1mbd). The profile table for the first 10 residue sites (one row corresponding to one site) shown consists of columns of the length equal to the total number of residue sites in the structure. In the profile table of this illustration, the 20 amino acids are sorted from left to right at each site, according to the compatibility score. Those amino acids in the native sequence that are highlighted sit on the left-hand side of the table, implying that they are energetically favorable to the structure.

In addition to finding the best 3D-1D alignment, it is important to evaluate the fitness between a sequence and a structure. Bowie et al. (3) used a simple measure, called the "single-body approximation," which considers a single amino acid residue in its entire structural environment. The Sippl potential (see Inverse Folding Problem) has a more advanced form of the two-body function defined for two interacting residues. Interactions between a central residue and all surrounding residues would give the total interaction energy for the central residue. The two-body function is not logically compatible with the 3-D profile, however, because the surrounding residues will not be known until after the alignment is fixed. This dilemma has been solved by the "frozen approximation" (4), in which it is assumed that the native amino acids of the model structure are in the surrounding sites. In this way, a combination of the Sippl-type potential (or others) with the 3-D profile treatment provides rapid and more reliable predictive methods (5).

Other types of methods can be envisaged for directly threading the entire sequence through a structure without using the 3-D profile, and they could be compatible with any type of potential functions. The problem, however, is the large number of alignments possible. Jones et al. (6) managed to solve the problem by introducing the double dynamic programming algorithm, which had been developed in pairwise comparisons of 3-D structures by Taylor (7). Alternatively, Monte Carlo-type optimizations may be used (8). The computations for these methods of direct threading are very time-consuming.

Many investigators have contributed to the development of various threading methods. To compare the effectiveness of each, a worldwide prediction contest, named (Critical Assessment of Techniques for Protein Structure Prediction CASP) has been held twice thus far, in 1994 and 1996 (9). A completely blind test was arranged so that all predictor would make their own predictions for target proteins, whose 3D structures were not yet known but were in the process of being determined. Then, an appointed assessor evaluated all of the predictions by comparing them with the structures that had subsequently been determined and presented the results at a joint meeting of all of the predictors. The results indicated that several predictions, but not all, actually inferred the correct folds of target proteins from their sequences alone (10). The sequence identity between the known and predicted proteins was around 20% or less, too low for any methods other than threading to identify the structural relationship. Therefore, the prediction contests clearly demonstrated the applicability of the threading method. At the same time, however, some drawbacks were revealed. One of them is the ambiguity often seen in 3-D-1-D alignments obtained by direct comparisons of structures (11). There are still problems to be overcome, and the method should be refined further. In any case, several successful examples uncovered by CASP were the first clear verification of the correct prediction of 3-D protein structures after a number of unsuccessful trials over many years.

Next post:

Previous post: