Modeling protein evolution (Genetics)

Proteins are the biological macromolecular entities most closely and directly related to organismal function, and knowledge of protein structure and function is critical to understanding biological organization. The evolution of protein sequence is molded by the specific requirements of structure and function, and thus an important avenue for predicting features of protein structure and function is to model the evolutionary process in proteins. Despite this, models of protein evolution have lagged far behind models of DNA evolution. An obvious and important reason for this is that proteins are composed of 20 amino acids, while DNA is composed of only four nucleotides; it is both harder to make calculations with amino acids and harder to find data sets sufficiently large to accurately estimate substitution probabilities between so many states. These problems become compounded when the evolutionary process occurring at the DNA and protein levels are combined to model evolution of amino acid coding units (codons, which have 64 states). A somewhat more subtle point is that DNA models have progressed partly because they usually assume, implicitly or explicitly, that nucleotide substitutions during the course of evolution occur in a random, or neutral fashion, with all positions at all times evolving independently in the same manner and rate. In contrast, experimental evidence (e.g., mutagenesis and functional analysis) has long shown that different positions in proteins can have wildly different tolerances for different amino acids and that nonadditive or interactive functional relationships among positions abound.

Although it causes computational difficulties, the complexity of protein evolution also provides useful opportunities. Because proteins are under higher degrees of selective pressure, they often change more slowly, and their evolution can allow us to look back further in evolutionary time. The dependence of the evolutionary process on a protein’s structure and function means that we can use the evolutionary data to gain insight and understanding of these fundamental characteristics. Studying changes in the evolutionary process of individual proteins can tell us about how protein structure and function change over time. Finally, the sequence databases represent the record of eons of natural mutagenesis and selection experiments, allowing us to probe the relationship between protein sequence and resultant protein properties.

A number of recent developments have allowed the modeling of protein evolution to become a more accurate, diversified, and broadly useful pursuit. An increase in computational power is perhaps foremost among these developments, but not simply because the same old calculations can be performed faster. In addition to increasing speed, computational advances have spurred and made practical the development of novel and sophisticated statistical methodologies using complex models that were unthinkable when computers were slower. These fundamentally model-based approaches allow incorporation of biochemical knowledge and testing of evolutionary hypotheses in a flexible and statistically sound manner. They also allow the incorporation of hypotheses concerning the phylogenetic relationships among genes or species, a component that is essential for reducing noise and spurious correlations in evolutionary analyses. The simple but obviously incorrect assumption of treating sequences as though they are independent (unrelated) entities is no longer necessary or advisable.

Another important development, the same that spurred the creation of the geno-mics and bioinformatics fields, was the advent of rapid, cheap, and large-scale DNA sequencing capabilities. Along with the sequencing efforts focused on obtaining large quantities of sequence from one or a few species (which except for large multigene families are relatively useless for studying evolutionary processes), there has also been considerable production of homologous sequences from divergent organisms. This sampling of many taxa, or “genomic biodiversity”, is essential to the development of sophisticated and realistic models of protein evolution, since discerning the evolutionary behavior at individual sites requires enough biodiversity data such that many substitutions will have occurred at each individual site over the evolutionary time during which the sequences diverged.

Finally, the newest and potentially most revolutionary developments have been in the fields of protein structure prediction, experimental determination, and design. Despite a tradition of unbridled optimism, protein folding and structure prediction has long been problematical and even less tractable than studying protein evolutionary dynamics. There has been more success with the inverse problem, that of finding sequences that conform to a particular fold. A dramatic recent success in the inverse folding field used a heuristically optimized combination of simple energy potentials and observed distributions of conformations for both amino acid side chains and main chain oligomers. Calculations for these methods are relatively rapid, and are in the range of being useful for evolutionary studies and predictions. In addition, observations of adjacency statistics between pairs of residues in an enlarged database of three-dimensional protein structures (from X-ray crystallography or NMR), although obviously crude, have provided useful information for modifying amino acid substitution probabilities, and have provided semirealistic complex models for simulating long-term evolutionary processes and discovering consequences of these processes that may be reflected in real proteins.

How then have models of protein evolution been developed to incorporate biological realism? For a long time, substitution models were never inferred from an individual data set of interest, but were instead obtained from observations of differences between many closely related protein pairs or sets of sequences. Since these observations were averaged over protein positions, evolutionary time, and many unrelated proteins, they necessarily assumed an unwarranted consistency.

Relatively early modifications included specific focus on slowly evolving sites, particular genome types (e.g., mitochondria or viral genomes), or residues involved in particular structural features (e.g., a-helices, j-sheets, or buried residues and residues exposed to solvent), as well as collecting observations for different amounts of evolutionary divergence. Eventually, models of transition between these different contexts along the sequence (hidden Markov models) were used to allow incorporation of these specialized observations into phylogenetic likelihood analysis. Some success in predicting secondary structure was achieved, along with more probable reconstructions of protein phylogenies. Still, though, there was not great reason to believe that these specialized contexts, defined on the basis of preconceived conceptions of what might be most important in determining evolutionary processes, did not represent unknown amalgams of heterogeneous processes that were yet to be deciphered. Although incorporation of rate variation among sites, a technique used in DNA evolutionary analysis, was used to address some of this hidden variation, an important advance was the use of mixture models and substitution matrices derived as functions of the chemical properties of amino acids. These mixture models allow the associations of substitution matrices with positions in an alignment to arise freely during the course of analysis, and thus can obtain novel information and produce inferences that are not possible when the substitution classes are predefined. The evolutionary process may also change over time, but evolutionary models incorporating such change are still in their infancy, and still limited by the size of current data sets.

One of the more important reasons for improving models of evolution in proteins is to understand the forces of selection that act on proteins, and to separate these forces from stochastic processes that affect substitution probabilities (random drift). Although it cannot be said that these two processes of change have been cleanly separated, in many cases statistical features have been evaluated that strongly indicate selection. The most clear-cut cases are those of diversifying selection, in which a sort of molecular cat-and-mouse scenario emerges that drives amino acid substitution at an accelerated rate that is greater than neutral expectations. Such scenarios may occur commonly in situations involving, for instance, a pathogen avoiding a host immune response, but still represent special cases and are not observed in most proteins. It is much more common that a burst of amino acid substitution might occur on a particular branch, possibly as a result of a change of function or specificity, and this may be detected as a brief elevated rate of amino acid versus nucleotide substitution. It is also possible to detect coevolution between residues in proteins, in which case substitutions at one position alter substitution probabilities at other positions. When genes have duplicated, it is possible to detect changes in rates or patterns of substitutions at individual sites, and thus to identify changes that may have been due to functional divergence. Finally, it is possible to use evolutionary analysis to predict ancestral sequences; these can then be resurrected and analyzed to infer patterns of functional change along the phylogeny (a process sometimes referred to as “paleobiochemistry”).

Despite the successes of these types of analyses, it may still be argued that we are far from a complete and realistic model of protein evolution. Many of the techniques used simply detect extreme evolutionary processes or unexplained changes in the evolutionary process, while the causal mechanism (adaptive burst, functional divergence, coevolution) is more a matter of perspective and hope than of direct evidence. Many of these mechanisms may be related and interact in ways that have yet to be deciphered. Evolutionary analyses have also not yet made dramatic progress in prediction of protein structure and function, although preliminary results are promising. The advances in protein engineering are therefore quite exciting, in that they may provide an avenue for integrating models of evolutionary change with realistic biophysical models that predict sequence compatibility with specific structures. Simple pairwise contact probabilities have already been used to model sequence evolution in structures from the protein data bank, and the full integration of structure and function prediction methods with models of protein sequence evolution may soon provide more accurate and useful tools for both purposes.

Next post:

Previous post: