Cuticular Proteins (Insect Molecular Biology) Part 3

CPR family:

Proteins with the R&R consensus By far the most common family of cuticular proteins is that containing the R&R Consensus. The name comes from a 28-aa motif, first recognized by Rebers and Riddiford (1988) in six cuticular proteins. The original R&R Consensus is part of a longer conserved sequence, pfam00379. A valuable website, Pfam (http://pfam.janelia.org/family/), has used hidden Markov modeling to define motifs characteristic of particular classes of proteins (Bateman et al., 2002). In accordance with recent nomenclature, this extended consensus region of about 63 amino acids will be referred to hereafter as the R&R Consensus. When a protein sequence is searched against non-redundant protein sequences using blastp at the BLAST server (http://www.ncbi.nlm.nih.gov/BLAST/), the first information that is presented is an indication of matches to pfam entries. The pfam sequence that allows annotators to classify a protein as a cuticular protein in the CPR family is pfam00379, a 68-aa sequence that includes the extended R&R Consensus. It also goes under the name "chitin_bind_4," for reasons that will become apparent in section 5.5.4. Pfam00379 was obviously based on proteins of both RR-1 and RR-2 classes, for it matches neither particularly well. This makes it particularly useful for a preliminary classification of a putative cuticular protein sequence.

An indication of the importance of the R&R Consensus comes from the Pfam website. It reports 2456 distinct proteins with the Consensus from 67 different species of arthropods (http://pfam.janelia.org/family?acc=PF00379 #tabview=tab6). This is an underestimate, because close to 100 sequences from Hymenoptera are absent. The CPR family is restricted to arthropods. The one exception, Xenopus NP_001090156.1, is due to a contaminating sequence from Drosophila erecta (Willis, 2010).


While 98% of the entries have only a single occurrence of the R&R Consensus, the exceptions are interesting. The most notable exception is a protein from the tailfin of the prawn Penaeus japonicus. The entire sequence of this protein is made up of 14 consecutive pfam00379 motifs (Ikeya et al., 2001). A protein from the horseshoe crab Tachypleus tridentatus (BAE44187.1) has five Consensus regions (Iijima et al., 2005), and the current annotation of the Ixodes scapularis genome reports several instances in a single predicted protein. Manual annotation revealed that most insect genes predicted to code for a protein with more than a single Consensus region actually coded for multiple proteins, easily recognized by standard markers of gene and transcript boundaries. There remains a small number of insect proteins that genuinely appear to have two Consensus regions, and the only one with three has orthologs in several species (Willis, 2010). When only a single Consensus region is present, it can be found near the N- or C-terminus, or within the protein. Three distinct forms of the Consensus have been recognized and named by Andersen (1998, 2000): RR-1, RR-2, and RR-3. RR-1-bearing proteins have been isolated from flexible cuticles, while RR-2 proteins have been associated with hard cuticle. This generalization was based on relatively few cases, and it has also been suggested that RR-2 proteins will contribute to exocuticle while RR-1 will be found predominantly in endocuticle (Andersen, 2000). This issue has not been resolved, even with the extensive expression data that are now available (Togawa et al., 2007). Hopefully, immunolo-calization data (see section 5.2.1.3) will prove helpful. The RR-2 Consensus region is far more conserved in length and sequence than the one from RR-1 proteins, as can be seen in the WebLogos in Figure 1. The website CuticleDB provides a tool using Hidden Markov Modeling to learn if a protein is RR-1 or RR-2 (Karouzou et al., 2007).

Within the CPR family, numerous proteins can be identified that have orthologs in several species, some with distinct Consensus regions and other features (Cornman and Willis, 2008; Zhang and Pelletier, 2010).

The wealth of information on cuticular protein sequences and the unraveling of how the structure of some contributes to the interaction of chitin and protein (see section 5.5) is only a beginning. Essential properties of cuticle remain to be explained, and important questions raised in the older literatures about various means of achieving cuticle plasticity and the importance of hydra-tion in cuticle stabilization must not be forgotten (Vincent, 2002, and references therein).

An especially interesting member of the CPR family is the resilin gene. The name "resilin" has been given to the rubber-like proteins responsible for the elasticity of jumping fleas and vibrating wings. Analysis of resilin-bearing cuticles in froghoppers (Aphrophora alni and Philaenus spumarius) concludes that resilin can function in two quite different ways. It is used: as an energy buffer in rhythmically active, fast mechanical movements, such as those of the wings during flight or the tymbals in cicadas . . . The almost perfect elastic recovery of resilin and its extreme resistance to mechanical fatigue mean that it can return nearly all of the power put into it for the next cycle of movement. The second role . . . is in providing a flexible material that is combined with the stiffer chitinous cuticle in a composite structure.

The first identification of a complete sequence for resi-lin was carried out by Ardell and Andersen (2001), who used peptides from locust (Schistocerca gregaria) and cockroach (Periplaneta americana) resilin to identify a likely homolog in D. melanogaster. The peptides came from the R&R Consensus region. The protein they identified was CG15920. Its 18 N-terminal copies of a 15-residue repeat and 13 C-terminal copies of a 13-residue repeat were predicted to contribute to a beta-spiral, a common form for proteins with elastic properties (Ardell and Andersen, 2001). The corresponding gene produces two transcripts; one lacks over two-thirds of the start of the Consensus region. Two groups have studied the physical properties of CG15920 and its repeat regions, and showed that they have the properties one would expect of highly elastic proteins (Elvin et al, 2005; Qin et al, 2009).

The identification of resilin in other species is complicated. Two recent analyses, one brief (Willis, 2010) and the other detailed (Andersen, 2010b), emphasize the difficulties and reach different conclusions about some possible homologs. Andersen emphasizes the need for repeat regions that would underlie the elastic properties, while Willis focused on the R&R Consensus region that showed such conservation between Schistocerca, Peripla-neta, and D. melanogaster. Both conclude that an authentic resilin gene should code for the Consensus, although alternative splicing may eliminate it in some of its transcripts. A major complication is that Lyons et al. (2007) found highly elastic physical properties of an An. gambiae protein coded for by an EST BX61916.1, but the corresponding gene, AGAP002367, lacks the Consensus region and is most closely related to a D. melanogaster protein (CG7709) that has been characterized as a mucin.

Comparison of the highly conserved RR-2 Consensus with the more variable one from RR-1 proteins. WebLogos were constructed at <http://www.weblogo.berkeleky.edu/logo.cgi.> (Schneider and Stephens, 1990; Crooks et al., 2004). (A) RR-1 Consensus regions from 52 sequences from B mori. (B) RR-1 Consensus regions from 51 sequences from An. gambiae. (C) WebLogo constructed from 87 B. mori and 101 An. gambiae sequences.

Figure 1 Comparison of the highly conserved RR-2 Consensus with the more variable one from RR-1 proteins. WebLogos were constructed at <http://www.weblogo.berkeleky.edu/logo.cgi.> (Schneider and Stephens, 1990; Crooks et al., 2004). (A) RR-1 Consensus regions from 52 sequences from B mori. (B) RR-1 Consensus regions from 51 sequences from An. gambiae. (C) WebLogo constructed from 87 B. mori and 101 An. gambiae sequences.

The Anopheles protein with the closest similarity (74%) to the Consensus region of Dmelresilin is AgamCPR152, but this protein lacks any repeats. Andersen identified AgamCPR140, which has many repeats, as a possible resilin, but the Consensus region is only 34% similar to Dmelresilin.

One consistent property of resilin is its ability to fluo-resce due to di- and tri-tyrosine cross-links (Andersen and Weis-Fogh, 1964), so a combination of studies that establish anatomical location of candidate proteins along with the physical properties should make it possible to sort out what sequences are truly resilin, and identify possible differences when it is serving its different roles.

CPF and CPFL families A motif corresponding to a 51-aa repeat first recognized by Andersen et al. (1997) has been identified in a modified form in at least 9 orders of insects. However, the common repeat is somewhat shorter, at 42-44 amino acids, so the original name for the CPF family has been retained, with the F now referring to forty rather than to fifty. A detailed discussion of this family can be found in Togawa et al. (2007), where it is pointed out that in addition to the conserved motif of about 42 amino acids, the proteins are also similar in the amino acids near their carboxyl-termini.

The C-terminal region characteristic of the CPF family has also been found in other cuticular proteins that lack the defining consensus. Togawa et al. (2007) named these CPFL, for CPF-like. All four CPF proteins and six of the seven CPFL proteins in An. gambiae have been verified as authentic cuticular proteins, based on shared peptides identified in a tandem mass spectrometry analysis of cast cuticles (He et al., 2007).

CPF and/or CPFL proteins have been identified throughout the hexapoda, including collembola and diplura (Table 4), but not yet in Crustacea or Chelicerata.

TWDL family One of the families of low-complexity proteins previously had been identified in D. melanogaster and named TWDL after the tubby phenotype in one of its mutants that reminded the authors of Tweedle Dee (Guan et al., 2006). There are 27 members of this family in D. melanogaster, 12 in An. gambiae, and fewer in other insects (Table 3). Their relationships are discussed in detail in Cornman and Willis (2009). Four conserved regions were defined by Guan and colleagues, and they remain diagnostic of the proteins across the Insecta (Figure 2). One member of the TWDL family (BmorCPT1) has been identified in a proteomics analysis of larval chitin-binding proteins, and a recombinant version binds chitin in an in vitro assay (Tang et al., 2010).

CPLCG family The largest of the new CP families is CPLCG, recognized by a conserved G-x(2)-H-(x2)-P (Cornman and Willis, 2009). The x residues are restricted to just a few amino acids, and a sequence logo, encompassing a longer stretch of conserved amino acids, is shown in Figure 2. Two members of this family had been reported in D. melanogaster (Qiu and Hardin, 1995), 3 are now recognized, along with 27 in An. gambiae. The D. melanogaster sequences had been named Dacp-1 and -2, but members of the family are not restricted to adults, and the CPLCG name is more accurate. Furthermore, the family is not restricted to the Diptera, but was identified in other orders of insects and the crustacean Daphnia (Tables 3, 4).

CPLCW family Another small family, CPLCW, appears to be restricted to mosquitoes (Table 4). The WebLogo (Figure 2) shows the invariant W after which it was named, but several other amino acids in a 29-aa region are also almost invariant. Its nine genes are clustered in An. gambiae interspersed among some members of the CPLCG family, but the protein sequences of CPLCG and CPLCW families are distinct, having an average similarity of only 20% (Cornman and Willis, 2009).

CPLCA family The CPLCA family has from 13% to 26% alanine residues, but this number is not higher than in some members of other families; rather, the family is best identified by the presence of the retinin domain (pfam04527), although the D. melanogaster protein retinin is an outlier in the phylogeny of the group (Cornman and Willis, 2009). A WebLogo more typical of the group has been created (Figure 3). While the first published account of this family (Cornman and Willis, 2009) stated that it is restricted to Diptera, there is clearly an EST in Daphnia that has a sequence corresponding to the WebLogo (FE341353.1).

CPLCP family This is the most problematic of the cuticular protein families. Peptides corresponding to four genes turned up in the proteomics analysis of proteins from larval head capsules and cast pupal cuticles of An. gambiae (He et al. 2007). An additional 23 genes coding for related proteins are also present in An. gambiae, but none have yet been confirmed by proteomics, although their expression profiles resemble those of authentic low complexity cuticular proteins (Cornman and Willis, 2009). Members of the family have a high density of PV and PY pairs, but additional features described by Cornman and Willis appear to be restricted to mosquitoes where both Aedes and Culex have been found to have larger families (Cornman and Willis, 2009; Willis, 2010).

CPG, the glycine-rich protein family A group of 28 genes enriched in GGGG or GGxGG repeats was described in B. mori (Futahashi et al., 2008), but the group appears heterogeneous because six proteins with only zero to three repeats appear to belong to the CPLCP family; these were identified after that paper was published (Willis, 2010). Another subset of 18 appears to be lepidopteran-specific, and these can appropriately be designated as CPGs (see Willis, 2010, Supplementary Material 2, for details).

Apidermin family Three apidermins, small (6.1-9.2 kDa), highly hydrophobic, and with at least 30% alanine content were described in Apis mellifera (Kucharski et al., 2007), and now three have been found in Nasonia, but as presently annotated, they are much larger (23-39 kDa). Members of the family do not have an obvious structure; rather, they were recognized by chromosomal linkage, and their role in the cuticle was confirmed with RT-PCR on cuticle-forming tissue. At present they have only been identified in Hymenoptera (Table 4). Their designation as a family thus is based on the initial publication, not the normal criterion of shared sequence similarity, and so it is not possible to evaluate the significance of numerous EST sequences from the beetle, Diaprepes abbreviates that are somewhat similar to A. mellifera apidermin 1 (e.g., CN474619.1).

WebLogos (see Figure 1) for three cuticular protein families. (A) TWDL family. Twenty-four sequences from eight species in six orders of insects were used. The continuous sequence was split to facilitate recognition of the four conserved regions. (B) CPLCG family. Note the highly conserved GHPG at residues 5, 8, 11, 14. Eighty-six sequences from dipterans were used. (C) CPLCW family. The 26 CPLCW sequences of this mosquito-restricted family were used. Unlike other WebLogos, the alignment for this one required gaps of five or eight amino acids between positions 16 and 25 to accommodate the longer Ae. aegypti sequences.

Figure 2 WebLogos (see Figure 1) for three cuticular protein families. (A) TWDL family. Twenty-four sequences from eight species in six orders of insects were used. The continuous sequence was split to facilitate recognition of the four conserved regions. (B) CPLCG family. Note the highly conserved GHPG at residues 5, 8, 11, 14. Eighty-six sequences from dipterans were used. (C) CPLCW family. The 26 CPLCW sequences of this mosquito-restricted family were used. Unlike other WebLogos, the alignment for this one required gaps of five or eight amino acids between positions 16 and 25 to accommodate the longer Ae. aegypti sequences.

WebLogos (see Figure 1) for two cuticular protein families and one motif. (A) CPLCA family. The WebLogo is based on three sequences from each of four species, An. gambiae, Ae. aegypti, C. pipiens, and D. melanogaster, that had the closest match to AgamCPLCA1. This region corresponds to the retinin domain. (B) WebLogo for CPCFC family. Data from the single occurrence of this protein in individual genera of eight insect orders, plus the two occurrences in T. castaneum and Heliconius melpomene. All three (two in Coleoptera and Lepidoptera) repeat regions from each protein were used. (C) The 18 amino acid repeat from 40 sequences from 26 proteins from 5 insect orders and 2 crustaceans.

Figure 3 WebLogos (see Figure 1) for two cuticular protein families and one motif. (A) CPLCA family. The WebLogo is based on three sequences from each of four species, An. gambiae, Ae. aegypti, C. pipiens, and D. melanogaster, that had the closest match to AgamCPLCA1. This region corresponds to the retinin domain. (B) WebLogo for CPCFC family. Data from the single occurrence of this protein in individual genera of eight insect orders, plus the two occurrences in T. castaneum and Heliconius melpomene. All three (two in Coleoptera and Lepidoptera) repeat regions from each protein were used. (C) The 18 amino acid repeat from 40 sequences from 26 proteins from 5 insect orders and 2 crustaceans.

CPAP1 and CPAP3 families A recent publication has identified two more families of cuticular proteins: CPAP1 and CPAP3 (Jasrapuria et al., 2010). They are unusual in that they have multiple cysteine residues, an amino acid rarely found in cuticular proteins from the other families. The families were named because they resemble some peritrophins, hence are peritrophin-like, but the spacing of the cysteines is distinct. The names come from Cuticular Proteins Analagous to Peritrophins. Comparable groups of six cysteines have been demonstrated to form a chitin-binding domain called the "peritro-phin A domain," or "type 2 chitin-binding domain" (ChtBD2), with the six cysteines assumed to form three disulfide bridges. An exhaustive search for proteins with this domain was carried out in Tribolium accompanied by RT-PCR analysis of their temporal and spatial distributions. It yielded, in addition to members of the two new families of cuticular proteins, several genuine peritrophins, as well as chitinases and chitin deacetylases. It is assumed that the ChtBD2 domains in all these proteins bind chitin, but this has only been demonstrated experimentally for a chitinase (Arakane et al., 2003) and a CPAP3 protein from another species (Nisole et al., 2010; see also section 5.5.4). So far, the CPAP1 family, with only one ChtBD2 domain, has only been identified in beetles, but the CPAP3 family, with three ChtBD2 domains, is more widespread (Table 4). Indeed, its motifs are found outside the arthropods (Jasrapuria et al., 2010). The founding members of the CPAP3 family were a group of proteins, named obstructers, in D. melanogaster (Barry et al., 1999; Behr and Hoch, 2005), among them a protein, Gasp, found in tracheae.

CPCFC family There is a third cuticular protein family with well-conserved cysteine residues. The founding member is BcNCP1, first identified in Blaberus craniifer (Jensen et al., 1997). It has three repeat regions, each with a pair of cysteines separated by five other amino acids; the first and fourth amino acids in each repeat are proline. In a recent publication (Willis, 2010) family status was not recognized, because at that time there were only single occurrences of BcNCP1 orthologs in any species, and a family must have paralogs within a species. Now that criterion has been met in Heliconius melpomene and Tribolium, and likely in another beetle, Diaprepes abbreviates, each with two related genes. This t recognizes family status for these paralogs. Several other species of beetles and moths have good orthologs, and in every case the middle cys-bearing region is missing. We are naming this family CPCFC in recognition of the two or three pairs of cysteines that are separated by five amino acids. A WebLogo is shown in Figure 3.

Cuticular proteins not assigned to families There remain some cuticular proteins that have not reached the criterion for belonging to families. Among them are three proteins identified with proteomics in An. gambiae (described in Cornman and Willis, 2009), and a group called CPH (cuticular protein hypothetical) in Bombyx mori (Futahashi et al., 2008). Some of the CPH can now be assigned to families; others remain unclassified.

Next post:

Previous post: