In silico approaches to functional analysis of proteins (Bioinformatics)

1. Introduction

Proteins are unrivaled as the primary functional agents of all biological systems. Hence, understanding how protein sequence and structure relate to its function is tantamount to decoding the most basic aspects of any given biological system. The emergence of the genomic era has resulted in an explosive growth of available protein sequences and other forms of protein-related information. This has not only placed considerable pressure on the conventional techniques of biochemistry and molecular biology to scale up their range of operations but has also created a strong demand for theoretical prediction of biological and biochemical functions from the available data. In particular, the immense preponderance of protein sequence data over all other forms of information pertaining to proteins places a considerable premium on the need to use sequence information as the central platform for exploring the core aspects of biology. Accordingly, reconstructing protein function requires an understanding of the dependencies between the sequence of a protein, and its structure, cellular localization, interaction with functional partners, and its contributions to the fitness of the organism. Fortunately, the ever-increasing flux of data, combined with the basic principles of protein biochemistry and evolutionary biology are only improving our understanding of these dependencies and our ability to ask new questions, which were previously impossible. In this article, we shall briefly consider the founding principles required to tackle the issue of proceeding from protein sequence/structure to biological function.


2. Information in protein sequence: domain identification and function

Pioneering explorations of the protein universe by the early structural biologists resulted in the identification of an organizational level beyond the a-helix, the j -strands, and other basic elements of protein secondary structure. These higher-level structural features were characterized by the presence of independent folding units that were spatially well delineated from each other and often mapped to a collinearstretch of sequence in the polypeptide. These units, termed domains, proved to be the most relevant organizational aspect of protein in terms of understanding protein function. A striking example that illustrated the domain concept in the early days was provided by the structure of the immunoglobulins (Doolittle etal., 1966). These proteins had a number of structurally similar domains per polypeptide, each of which was largely composed of j-strands. Comparisons of these structurally similar immunoglobulin domains revealed a similarity even at the sequence level. This suggested their origin from an ancestral sequence unit that specified a single domain, through the process of duplication followed by sequence divergence. Identification of domains also provided the first clues regarding events beyond simple mutations in the evolution of proteins, such as recombination between unrelated genes. Thus, the genesis of large, complex proteins was made more tangible by explaining it in terms of duplications and shuffling of the smaller units, namely, the domains.

Furthermore, there also appeared to be a strong correspondence between these domains and specific biochemical activities or functions of proteins that contained them. Early examples of this principle were provided by the flavin or nicotinamide dinucleotide-binding Rossmann fold domain (Rao and Rossmann, 1973) and the DNA-binding Helix-turn-Helix domain (HTH) (Matthews et al., 1982; Sauer et al., 1982). These domains were respectively identified as the structural common denominator in otherwise unrelated proteins that bound either FAD/NAD or DNA. Experimental evidence from these proteins showed that their respective substrate binding abilities were essentially dependent on the Rossmann fold or the HTH domains (Figure 1). This fairly consistent, unitary relationship between the biochemical activities of proteins and their constituent domains suggested that the delineation of domains in a polypeptide could lead to the prediction of its biochemical and biological properties. Additionally, the presence of sequence conservation corresponding to the structural similarity of the domains implied that protein sequence information could be used as the basis for these predictions (Figure 1).

From the earliest days of protein sequence comparisons, it became clear that several proteins with shared biochemical properties, such as DNA-binding, or catalytic activities, such as ATP hydrolysis or proteolysis, often shared collinear sequence patterns or motifs that correlated with a given activity (Bork and Koonin, 1996). Experimental analysis of these protein sequence motifs (PSMs) through site-directed mutagenesis, suggested that they often corresponded to the structural core of the fold, the active site, or the substrate interaction surfaces of a given protein (Bork and Koonin, 1996). Furthermore, superposition of these conserved motifs on to the three-dimensional structures of proteins, when available, showed that the motifs could also be used in defining domains in proteins through sequence analysis. While these labors strengthened the association of sequence motifs with protein domains and their biochemistry, they also emphasized the need for objective methodologies to weed out false relationships (patterns) (Iyer et al., 2001).

The models, such as the extreme value distribution, originally developed for pairwise ungapped local alignments obtained in similarity searches between a query protein and the target database provided the first objective means for evaluating the statistical significance of sequence alignments (Karlin and Altschul,1990). Subsequently, more sensitive sequence profile methods for the detection of remote sequence similarity were developed on the basis of either position-specific score matrices (PSSM) (Altschul etal., 1997) or hidden Markov models (HMM) (Durbin et al., 1998). Simulations showed that the original extreme value model, developed for the ungapped alignments, was also valid for gapped alignments and those obtained in the sequence profile searches (Altschul et al., 1997). These developments enabled the robust definition of PSMs, detection of distant structural relationships and protein domains using sequence information available in the protein databases (see Article 39, IMPALA/RPS-BLAST/PSI-BLAST in protein sequence analysis, Volume 7 for details). Thus, with the basic procedure for the objective identification of protein domains through sequence analysis in place, the path for the in silico journey from protein sequence to function was firmly established.

Correspondence between shared conserved domains and common biochemical function. Various transcription factors that share the common biochemical property of DNA binding also share a Helix-turn-Helix domain (HTH). This indicates that the correspondence between the HTH domain and DNA-binding function. The other conserved domains fused to the HTH domain are cNMPBD: the cyclic nucleotide-binding domain; DSBH: the double-stranded beta-helix domain, a sugar binding domain; PBPI: the periplasmic-binding protein I domain (also sugar binding). The grey boxes in the Antennapedia protein are the nonglobular regions

Figure 1 Correspondence between shared conserved domains and common biochemical function. Various transcription factors that share the common biochemical property of DNA binding also share a Helix-turn-Helix domain (HTH). This indicates that the correspondence between the HTH domain and DNA-binding function. The other conserved domains fused to the HTH domain are cNMPBD: the cyclic nucleotide-binding domain; DSBH: the double-stranded beta-helix domain, a sugar binding domain; PBPI: the periplasmic-binding protein I domain (also sugar binding). The grey boxes in the Antennapedia protein are the nonglobular regions

The foundational principles for protein sequence analysis are deceptively straightforward, but their actual application is often bedeviled with numerous caveats, and overlying modifiers coming from other sources of information. We shall briefly consider some of these issues in the reminder of the article.

3. Sequence conservation and its functional significance

The mere detection of evolutionary (genuine) sequence similarity (termed homology ) between two proteins does not mean that they are necessarily functionallyequivalent. For a proper evaluation of the relevance of the homology between a set of proteins, a detailed understanding of the qualitative nature of the relationship and its significance is required. A set of homologous proteins or domains are best represented as a multiple sequence alignment of the various independent occurrences of the domain or protein. The degree of relationship, defined on the basis of an objective method of sequence similarity, is typically a continuum between the extremes for any comprehensive multiple alignment of an assemblage of homologous proteins. Some copies of the domain are very similar, others more divergent, and so on, till we reach the limit of objective detection of relationships. This most inclusive set of evolutionarily related domains is defined as a superfamily. Natural subdivisions within the superfamily such as families and subfamilies may also be definable on the basis of sequence similarity. Typically, all members of a super-family might share a generic biochemical property, such as catalysis of a common class of reactions or binding to a particular class of substrates. For example, members of the GNAT acetyltransferase superfamily usually catalyze the transfer of the acetyl functional group to amino groups on various substrates. The identification of the GNAT acetyltransferase domain in a protein allows prediction of such a biochemical activity in that protein, but it does not automatically specify the substrate or the actual biological role played by the protein.

In many cases two or more superfamilies may show generic similarity in terms of the topology of secondary structure elements and features of their structural arrangements, despite the absence of any detectable sequence similarity between them. These superfamilies are then described as sharing a common fold (Andreeva et al., 2004). The identification of a shared fold in different superfamilies is usually not indicative of any specific shared biochemical properties, though they might show some general commonalities in terms of the spatial position of the interaction with substrates or location of active sites. Hence, the detection of a common fold in different superfamilies by itself is usually not a good predictor of similar biochemical aspects of their function.

An understanding of the levels at which natural selection sculpts a protein helps in gleaning more specific information from a set of homologous proteins. In most proteins from a given organism, natural selection acts in the purifying sense to maintain three different levels of information in the sequence:

1. The most stringent level is the structural level - selection weeds out any changes that could destabilize the basic folding pattern of the domain. Thus, the bulk of sequence conservation, particularly patches of hydrophobic residues, metal-chelating residues, and disulfide-bond-forming cysteines, corresponds to the residues that are likely to be critical for the folding and stability of a protein. These patterns are typically preserved throughout a given superfamily of proteins.

2. At the next level of stringency, natural selection safeguards the “active sites” of proteins. In the case of enzymes, these sites are the constellation of residues required for the catalytic activity and substrate interactions, while in the case of other proteins they are other key residues required for their characteristic biochemical activities. These are less critical as compared to the structure-conferring features because two proteins descending from a common ancestormight adopt different biochemical properties as they diverge. For example, one of the descendent proteins might be selected for its enzymatic activity and retain the catalytic residues, while the protein belonging to a sister clade descending from the same ancestor might be selected for its ligand-binding properties and would not be under any pressure to retain the catalytic residues, though both proteins are selected to retain the same structure. The identification of such residues is most critical for making meaningful prediction regarding the biochemical functions that might be shared by a group of proteins (Aravind et al., 2002).

3. At the lowest level in the hierarchy, natural selection preserves the features of a protein that contain information relevant to its biological context. These might include the residues necessary to interact with other proteins in the pathways or complexes in which they function. Such features are not preserved when proteins that diverge from a common ancestor adapt to different functional milieus. However, identification and characterization of such residues help in narrowing down on actual biological role of proteins to a much greater degree.

4. Low entropy and nonglobular structures in proteins

Beyond the well-defined domains that tend to form compact globular domains with regular secondary structure elements, there is an entire spectrum of low entropy structures ranging from superstructure forming repeat units to random coils. These nonglobular structures, while not following the same structure-function relationships as the globular domains, might still provide useful biological information. Amongst the most common nonglobular structures are transmembrane (TM) helices and signal peptides, which are characterized by compositionally biased regions of proteins that are highly enriched in hydrophobic residues (see Article 37, Signal peptides and protein localization prediction, Volume 7 and Article 38, Transmembrane topology prediction, Volume 7). Another class of nonglobular proteins includes the fibrous proteins, like keratin, which form long alpha-helical structures. This helix dimerizes with another such helix and forms a structure termed the coiled coil. Shorter coiled coils, also known as leucine zippers due to a periodic pattern of leucines, are found in a wide variety of proteins and serve as important determinants of protein dimerization (Lupas, 1996). Likewise, more complex repeats found in proteins, such as the tetratricopeptide repeat (TPR), typically organize into superhelical or propeller-like structures that serve as surfaces for interactions with other protein complexes (see Article 33, Protein repeats, Volume 7).

Yet another nonglobular segment observed in proteins is the unstructured or random-coil segment that generally does not assume a single stable configuration and occurs in solution in a highly mobile disordered state. These regions of proteins correspond to sequences with a low complexity – they are enriched in certain amino acids but lack others. In the extreme case, they might contain homopolymeric stretches of a single amino acid such as glutamine or proline, while in other cases di- or tripeptide repeats (see Article 32, Sequence complexity of proteins and its significance in annotation, Volume 7). These nonglobular regions are particularlycommon in the eukaryotes, and often contain the sites for modification by a variety of enzymes such as kinases, glycosyltransferases, and NH2-group acetyltransferases or binding sites for specific peptide-binding domains such as SH3 or WW domains (Zarrinpar etal., 2003). They might also contain signals for nuclear localization and other subcellular sorting signals. Given these associations, the identification of nonglobular segments often serves as a means to predict important aspects of a protein’s function, such as its subcellular localization and its potential to mediate protein-protein interactions. Thus, combining the information obtained from particular low entropy regions with the biochemical insights provided by the conserved globular domains often improves the precision of functional predictions.

5. The “society” of domains and the domain-Lego principle

The structural independence of domains allows different domains to associate or dissociate from each other within a single polypeptide in the course of evolution. The majority of domains show a certain degree of “social behavior”, that is, cooccurrence with other domains. As a result, a significant fraction of the globular proteins encoded by an organism are collections of domains fused together. In most cases, these domains occur successively, next to each other, or are separated from each other by short or long nonglobular segments. In a smaller number of cases, a domain might occur inserted in a loop of another domain. The order and the combination of domains found in a polypeptide is termed the domain architecture of the protein, and is generally schematically depicted by filled shapes standing for domains distributed along the length of the protein (Figures 1 and 2). Thus, the evolutionary process in which domains conglomerate in a protein can be likened to the construction of complex objects using simple Lego bricks (groove-bearing interlocking Danish bricks that is a recreational toy in parts of the Western hemisphere). Just as artists using a relatively limited repertoire of Lego bricks have been able to generate a remarkable diversity of forms, natural selection and recombination have fashioned an astonishing diversity of domain architectures.

Despite the variety of architectures, most of them can be grouped into a limited set of categories with specific teleological principles. The chief architectural categories include:

1. Solo domains: The occurrences of a domain all by itself in a polypeptide may or may not have a definite adaptive explanation (Figure 2). It can be safely assumed that this is the basal state of any domain in evolution, and any adaptively advantageous association with another domain is likely to be fixed. Thus, domains could stochastically flip in and out of neutral combinations with other domains, but any domain that has remained solo across evolutionarily distant organisms is likely to be under certain selective pressure to remain independent.

2. Architectures involving homodimeric or multimeric assemblies of the same domain: Individual domains that frequently function as dimers or multimers are selected to exist in domain architectures with two or more copies of the same domain in a polypeptide. The TATA-binding protein (TBP) that ancestrally bound asymmetrically exists in all its extant forms with a homodimeric architecture of two tandemly repeated TBP DNA-binding domains (Figure 2).

Categories of domain architecture. The domains depicted in the figure are P-loop NTPase: NTP utilizing catalytic domain of the P-loop fold; TBP: the DNA-binding domain of the TATA-binding proteins; DHQ synthase: 3-dehydroquinate synthase; EPSP synthase: 5-enolpyruvylshikimate-3-phosphate; ACT: a small molecule-binding domain; CBS: the so-called cystathionine beta synthase domains (also small molecule-binding domains); SH2: an adaptor domain that binds phosphotyrosine containing peptides; SH3: a polyproline peptide-binding adaptor domain; PTPase: phosphotyrosine phosphatase domain; Y-kinase: tyrosine kinase domain

Figure 2 Categories of domain architecture. The domains depicted in the figure are P-loop NTPase: NTP utilizing catalytic domain of the P-loop fold; TBP: the DNA-binding domain of the TATA-binding proteins; DHQ synthase: 3-dehydroquinate synthase; EPSP synthase: 5-enolpyruvylshikimate-3-phosphate; ACT: a small molecule-binding domain; CBS: the so-called cystathionine beta synthase domains (also small molecule-binding domains); SH2: an adaptor domain that binds phosphotyrosine containing peptides; SH3: a polyproline peptide-binding adaptor domain; PTPase: phosphotyrosine phosphatase domain; Y-kinase: tyrosine kinase domain

3. Hetero-multidomain proteins: Proteins that are composed of domains that participate in either the same or successive steps in a pathway are frequently encountered in a variety of biological systems. The selective advantages of such architectures are clearly linked with the necessity for close functional interactions of the proteins in a pathway. Such fusions are particularly common in metabolic enzymes from eukaryotes, like the multifunctional aromatic amino acid biosynthesis protein (Aro1) that combines at least five distinct enzymatic domains into a single protein, including the shikimate kinase of the kinase family of P-loop ATPase and the dehydrogenase of the Rossmann fold (Figure 2).

4. “Regulatory architectures” involving small molecule binding domains (SMBDs): A variety of low molecular weight compounds regulate responses to environmental stimuli, reaction rates in basic cellular metabolism, nutrient transport, and signal transduction cascades in biological systems. Some of the best-studied forms of small molecule-dependent regulation include allosteric control of enzyme activity, feedback regulation of biochemical pathways, and cataboliterepression. All these processes have selected for a variety of “regulatory architectures” that typically combine an SMBD with an effector domain (Figure 2). These latter domains are the actual agents of a certain biological activity, which is altered by the binding of a small molecule to the associated SMBD. Thus, we have SMBDs such as the ACT domain, which binds amino acids and other related small molecules, fused to a variety of catalytic domains in proteins such as homoserine dehydrogenase and aspartokinase. In these cases, the ACT domain binds substrates or downstream products and mediates the allosteric or feedback effect on the catalytic domains (Anantharaman et al., 2001).

5. “Adaptor architectures” or fusions of adaptor domains with effector domains: Very often, the performance of a certain biological function involves the targeting of a specific activity that resides in the effector domains to a certain cellular context. This context may be a subcellular location, or a precise protein complex or other biopolymers such as DNA or carbohydrates. The domains that recognize these contexts by binding specific target molecules pertaining to these contexts are generally termed adaptor domains. The most common of these architectures involve fusions of specific peptide-binding or protein-interacting domains with catalytic domains. These are abundantly represented by the fusions between adaptor domains such as the SH2, SH3, WW, and BRCT domains and catalytic domains such as protein kinase, GTPase activating protein, phospholipase, acetyltransferase, and ubiquitination-related domains (see Article 31, Protein domains in eukaryotic signal transduction systems, Volume 7). Yet other proteins are made up entirely of such adaptor domains, and act as “multiheaded” adaptors that bring together distinct protein complexes (Figure 2). Homophilic interactions are another common mode of action by which two proteins bearing the same adaptor domains, such as the DEATH superfamily of domains, associate with each other in a signaling pathway.

The above classes of domain architectures that are repeatedly reinvented in entirely unrelated proteins might be considered convergently favored solutions for problems in biological engineering in a domain-based world. Given that the above-discussed architectural classes are associated with specific teleological explanations, they represent a subtle source of contextual information that allows refinement of functional prediction of proteins when combined with more straightforward protocols of domain identification and nonglobular segment analysis (Figure 1).

6. Apprehending the “meta-picture”: extraneous sources of contextual information in establishing protein function

Most biological systems can be conceptualized as graphs (networks) whose nodes represent individual proteins and whose edges represent the functional connectives that radiate out from a protein at a given node to its partners occupying adjacent nodes (see Article 30, Contextual inference of protein function, Volume 7). These connectives encompass a considerably diverse set of functional interactions:

1. Physical interactions: This includes a variety of direct interactions such as the tendency of two or more proteins to form a functional complex or enzyme-substrate interactions between two proteins.

2. Indirect regulatory interactions: This includes transcriptional control of the gene encoding a given protein by another protein, which is a transcription factor, or indirection regulation of a target protein through the synthesis or delivery of a small molecule messenger.

3. Coexpression and colocalization: Coexpression is specified by congruent expression patterns of two or more genes whose products are likely to participate in a similar biological process. Colocalization implies the similar subcellular targeting of two proteins, which in turn points to their potential functional interaction in a common organelle or cellular compartment. These functional connections are not mutually exclusive of the earlier-discussed connections.

4. Genetic interactions: These connections might arise because of any of the above types of interactions or the general participation of two gene products in a common pathway. Genetic interactions can be ultimately decomposed into a specific type of underlying biochemical interaction, but even in the absence of specific biochemical explanations, genetic interactions can provide useful contextual information for functional inference.

These functional interactions that anchor a protein to its place in a biological network can often help in providing contextual information that goes over and beyond the information provided by intrinsic features like the domain architectures. This extraneous contextual information is usually critical in translating the biochemical inferences gleaned from sequence analysis to the actual biological roles of proteins.

Prior to the postgenomic era, the establishment of these functional connections was the exclusive realm of focused biochemical and genetic experimentation. However, these days, a large amount of contextual information is directly available and can be used for different kinds of in silico analysis to determine biological functions of uncharacterized proteins through the principle of “guilt by association” (see Article 30, Contextual inference of protein function, Volume 7). The basic idea here is to establish a link between an uncharacterized protein and one or more functionally characterized proteins by means of one or more of the above functional connections. This can then be used to implicate a protein in a particular functional pathway or biological process (Figure 3).

The simplest forms of contextual information arise directly from genome sequence data and are usually in the form of phyletic profiles of orthologous proteins, conserved gene neighborhoods, and lineage-specific expansions of protein families. Yet other forms of contextual information have been obtained from a whole range of high-throughput experimental studies on model organisms such as Escherichia coli, Saccharomyces cerevisiae, Caenorhabditis elegans, and Drosophila melanogaster.

A general scheme for in silico functional inference for uncharacterized proteins. The grey box denotes the sequence of an uncharacterized protein prior to domain identification. The yellow boxes with T represent TM segments, while "Doml" and "Dom2" represent two conserved globular domains that were detected in the protein

Figure 3 A general scheme for in silico functional inference for uncharacterized proteins. The grey box denotes the sequence of an uncharacterized protein prior to domain identification. The yellow boxes with T represent TM segments, while “Doml” and “Dom2″ represent two conserved globular domains that were detected in the protein

7. Overview of the section on protein sequence analysis and annotation

The confluence of various streams of data holds considerable promise in developing a unified basis for understanding biology in terms of the constituent modules of proteins and their functions (Figure 3). Accordingly, we have attempted to assemble a collection of articles in this section that provide primers regarding the various computational approaches that can be used to glean functional information from proteins. Some of the articles specifically explore the means of investigating nonglobular regions of proteins such as TM regions, signal peptides, and low complexity regions in making functional inferences and predicting protein localization (see Article 32, Sequence complexity of proteins and its significance in annotation, Volume 7, Article 33, Protein repeats, Volume 7, Article 37, Signal peptides and protein localization prediction, Volume 7, and Article 38, Transmembrane topology prediction, Volume 7). Other articles discuss the significance of conserved domains that are prevalent in specific cellular processes and systems such as signal transduction and chromatin organization (see Article 31, Protein domains in eukaryotic signal transduction systems, Volume 7 and Article 35, Measuring evolutionary constraints as protein properties reflecting underlying mechanisms, Volume 7). There is also a discussion of the application of modern sequence profile methods for the detection of distant relationships and domain discovery, and the role of curated protein classification databases in large-scale annotation of protein functions (see Article 36, Large-scale, classification-driven, rule-based functional annotation of proteins, Volume 7 and Article 39, IMPALA/RPS-BLAST/PSI-BLAST in protein sequence analysis, Volume 7). Finally, there is a perspective on the integration of various forms of contextual information to grasp function at the organismic level (see Article 30, Contextual inference of protein function, Volume 7).

Next post:

Previous post: