Sequence complexity of proteins and its significance in annotation (Bioinformatics)

1. Introduction

The concept of complexity of protein sequences originated from the consideration of sequences as strings of symbols that can be studied with linguistic methods. Simple descriptors such as amino acid compositional properties attracted early attention. These studies suggested that small globular proteins can be classified in accordance to amino acid composition and this classification corresponds roughly to secondary structural class, cellular localization, and enzyme function (Nishikawa and Ooi, 1982; Nishikawa et al., 1983a; Nishikawa et al., 1983b). Nevertheless, the predictive value of this correlation is limited, as later reanalyses with more abundant structural data have shown (Clementi et al., 1997; Eisenhaber et al., 1996b; Eisenhaber et al., 1996a).

With increased efficiency of sequencing in the eighties and nineties of the twentieth century, sequences of multidomain proteins became readily available. Some of their sequence regions differed markedly from previously known globular proteins by obvious abundance of a single or a combination of only very few amino acid types; therefore, these sequence segments were considered “unusual” or “extraordinary”. From the sequence point of view, these compositionally biased regions can be homopolymeric runs, irregular mosaics of a few residue types, or short-period (almost) regular repeats. Genome sequencing projects have shown that such proteins occur frequently, especially in eukaryote proteomes. The fly protein “brakeless” (acc. AAF76322, 2302 AA; its loss of function is associated with unlimited growth of optical axons) is an extreme example. The N- and C-terminal halfs (each of almost 1000 residues) are serine- and glutamine-rich respectively and only three tiny, scattered islands in the middle of the sequence are apparently of globular structure (Senti et al., 2000).


The discovery that amino acid compositional bias over domain-size segments is associated with the property of having a globular or nonglobular structure has fuelled the development of criteria for sequence complexity and methods to delineate simple, potentially nonglobular sequence segments as low complexity regions.

2. Criteria for measuring sequence complexity and methods for delineating compositionally biased regions

Whereas sequences from globular domains appear visually as “good” mixtures of residue types, the biased sequence segments are simple because of (1) their nonrandom, highly biased amino acid composition and (2) their tendency of repeating monotonously a small sequence motif. Various methods address these two points but indirect approaches are also possible. Sequence simplicity is often but not always associated with conformational flexibility (invisibility in X-ray structures or only with high B-factor), absence of regular secondary structure (coil status), and high solvent accessibility over long stretches; thus, tools that search for these properties might also be useful to segment a query protein.

The compositional aspect can be evaluated with compositional complexity measures calculated over sequence windows. Given an amino acid composition {ni}i=ii…j20 and a sequence window of length L = ^ni, there are N = L!/Uni possible sequences. In the SEG algorithm (Wootton, 1994a; Wootton and Federhen, 1996), subthreshold segments for the Shannon entropylike term

tmp114-11_thumb

are searched for and concatenated to a raw region first, after which the subsegment of the raw region with lowest probability K1 = log N/L is found. Depending on the window length and thresholds, the sensitivity of SEG can be increased and it can find even regions with subtle compositional deviations such as coiled coil regions.

Other methods such as CAST (Promponas et al., 2000) or P-SIMPLE (Alba et al., 2002) search for simple segments with bias toward a single amino acid type or a small motif (up to four residues), the first one by assessing alignments with homopolymeric runs and the second by counting motif occurrences in sequence windows. The POPP approach (Wise, 2002) focuses on the frequency of specific mono-, di-, and tripeptides in whole proteins or specified segments and compares them with distributions in sequence databases.

The Globplot tool (Linding et al., 2003b) is a propensity-based predictor and the DisEMBL package (Linding et al., 2003a) is a suite of neural networks, both parametrized from coil segments in known 3D (three dimensional) structures and can serve for finding extended unordered regions.

Any of the above techniques misses a notable share of nonglobular regions (Kreil and Ouzounis, 2004). With parametrizations for high sensitivity, the false-positive prediction rate is considerable. Thus, uncurated computer-generated annotations of protein sequence databases with these methods are not advisable.

3. Occurrence of compositionally biased regions in protein sequences, their classification, and their role in human disease

It is an early observation that amino acid sequence segments with amino acid compositional bias can occur in natural proteins but, originally, they were considered exceptional and rare. This view originated from the restricted availability of sequences, mostly of short, single-, or two-domain proteins with well-characterized 3D structure (typically, with metabolic function and of prokaryote origin). Indeed, globular domains with known 3D structure contain a well-balanced composition of various hydrophobic and different polar residues with functional groups. Otherwise, folding with tight packing in the core and a polar, solvatable surface would not have been possible. Not surprisingly, compositionally biased segments are rare in globular domains and comprise only about 0.5% of the sequence (Wootton, 1994b). They are usually short (<30 AA, ~15 residues on average). Most of the shorter ones appear as moderately polar stretches as part of long solvent-accessible loops or at the polypeptide termini, whereas the longer segments typically represent hydrophobic or amphipatic helices and are involved in the structural packing (Saqi, 1995).

With more efficient sequencing technologies and especially with the genome sequencing projects, the available subset of protein sequences has become more representative. Usually, proteins consist of functionally different sequence regions. Some of them may represent globular domains but the others have compositional bias and, typically, have a nonglobular nature. The amount of low complexity regions in large protein databases has been evaluated with the SEG program and was found to comprise about a quarter of all residues in known proteins. Interestingly, the fraction of low complexity sequence is higher in eukaryotes (an estimated one-third of the total proteome in Drosophila melanogaster) compared to ~10% in prokaryotes (Wootton, 1994b; Wootton and Federhen, 1996; Huynen et al., 1998).

The status of structural and functional characterization of low complexity regions is highly diverse. Some examples are listed in Wootton (1994b). Reasonably long spans (often rich in proline and small polar residues) between globular domains apparently serve just as conformationally flexible linkers. Low complexity segments with bias toward hydrophobic residues or with a repetitive hydrophobic pattern are also not so problematic. They are typically found in membrane-attachment regions, are buried in protein complexes, and/or form fibrillar structures.

In contrast, the structure forming potential and the molecular function of long low complexity regions with primarily polar residues remains insufficiently understood. Apparently, many of them carry sites for posttranslational modifications and interactions with other biomacromolecules and are important for cellular regulation and developmental processes (Wootton, 1994b). For example, mutational expansions of CAG triplets coding polyglutamine regions in human genes beyond a certain length cause neurodegenerative disorders such as Huntington’s disease, spinocerebellar ataxia 6, or Kennedy disease (Perutz, 2004). Surprisingly, polyasparagine repeats are suppressed in mammalian proteins (Kreil and Kreil, 2000). Many intrinsically unstructured proteins such as titin, prion, tau, and others appear to evolve via repeat expansion (Tompa, 2003).

Polar low complexity regions are especially abundant in the known part of the proteome from Plasmodium falciparum, Plasmodium berghei, and Dictyostelium discoideum – even relative to the background of other eukarya (Pizzi and Frontali, 2001). Possibly, variable immuno-dominant epitope loops of transmembrane proteins and the variety of glycosylation variants in polyasparagine tracts have clinical relevance since they are part of Plasmodium’s strategy to evade the immune response of the host (Duffy et al., 2003; Newbold, 1999).

The strict criteria for selecting low complexity segments in standard tools overlook many instances of regions with milder compositional bias that are nevertheless not globular either. The environment of sites for many posttranslational modifications (PTM) and subcellular translocation signals are examples of the more general theme of interaction of a globular protein with an extended and conformationally malleable region of another one (Eisenhaber et al., 2004; Iakoucheva et al., 2004). In the case of a PTM, only a small region of the binding site of the substrate protein directly interacts with the active site cavity of the modifying enzyme. To allow this interaction physically, the site of the PTM has to be surrounded by a sufficiently solvent-interacting sequence region without own inherent conformational preferences that links the site of the PTM with the remainder of the substrate protein (Eisenhaber et al., 2004).

4. The homology concept and the role of compositionally biased regions in sequence similarity searches

The most successful and widely known approach for inferring protein function with nonexperimental means, the annotation transfer from homologs, is based on the observation of similarity between protein sequences having similar 3D structure and molecular function. In evolutionary terms, even distantly related sequences are considered to originate from a common ancestor sequence in a mutational process.

The distance metrics used in sequence similarity searches, for example, the E-values in the popular BLAST/PSI-BLAST suites (Altschul et al., 1994; Altschul et al., 1997) that evaluate the probability of having a common ancestor, are correctly applicable only for globular sequence segments. The amino acid type substitution matrices such as BLOSUM62 (Henikoff and Henikoff, 2000) are derived from datasets of point mutations in secondary structural elements in globular domains of well-studied protein families and are, therefore, designed to detect subtle sequence similarities between distantly related sequence instances of the same family. Not surprisingly, false-positive hits in database searches are typically caused by compositionally biased regions, for example, cysteine- or proline-rich regions, long hydrophobic stretches of transmembrane helices/signal peptides or monotonous hydrophobic patterns of fibrillar structures.

Hence, it is necessary to remove compositionally biased regions from query sequences before sequence similarity searches. As a standard option, BLAST offers the most stringent standard parametrization of the SEG program for masking obviously low complex segments. Especially in the case of frequently interspersed homopolymeric runs, CAST is a good alternative (Kreil and Ouzounis, 2004). Often, both approaches are insufficient to remove the cause of collecting false hits and it becomes necessary to carefully determine membrane-embedded regions, fibrillar segments, peptides that are responsible for targeting to subcellular localizations (for example, signal peptides), regions responsible for posttranslational modifications (Eisenhaber et al., 2003; Nielsen et al., 1999), and other types of nonglobular segments in the query prior to the similarity search.

Next post:

Previous post: