Insect Transposable Elements Part 2

Methods to Uncover and Characterize Insect TEs

Early Discoveries and General Criteria

Before the availability of the large amount of genomic sequence data, TEs were often discovered by serendipitous observations during genetic experiments. As described in the introduction, McClintock’s observation of the unstable mutations in maize led to her discovery of two mobile genetic elements, Ac and Ds, although the molecular characterization of these elements came many years later (reviewed in Fedoroff, 1989). Similarly, the observation of an unstable white-peach eye-color mutation in Drosophila mauritiana led to the discovery of the mariner transposon (Hartl, 1989). The piggyBac transposon was discovered as an insertion in a baculovirus after passage through a cell line of the cabbage looper Trichoplusia ni (reviewed by Fraser, 2000). In a slightly different vein, the D. melanogaster P and I elements were discovered because of their association with a genetic phenomena called hybrid dysgenesis, which refers to a group of abnormal traits, including high mutation rates and sterility, in crosses of certain strains (Kidwell, 1977; Finnegan, 1989). The genetic mutations described above, albeit rare, tend to identify active transposition events that resulted from active TEs in the genome.

The repetitive nature of TEs can also be used for their discovery and isolation, although not all repetitive elements are TEs. When DNA sequences are available, TEs can be identified on the basis of either similarity to known TEs, or common structural characteristics. In some cases, evidence for past TE insertion events could be identified on the basis of sequence analysis, which further supports the mobile history of a particular element. The criteria and methods described in this section are not unique to insects. However, it may be necessary to visit this topic here because of the lack of a systematic review on these issues, and because of the growing interest in TE analysis in the current genomic environment.

Experimental Methods to Isolate and Characterize Repetitive Elements

Several experimental approaches have been used to discover TEs on the basis of their repetitive nature. Although relatively straightforward, these methods may not clearly distinguish between TEs and other repetitive sequences in the genome. In other words, the repeats discovered using these methods are not always TEs. One way to discover repeats in the genome is to isolate visible bands in an aga-rose gel running a sample of restriction enzyme digested genomic DNA. This is based on the assumption that only highly reiterated sequences containing two or more conserved recognition sites for the restriction enzyme will produce a visible band amongst the smear of digested genomic DNA. The bands can be cut out from the gel and purified for cloning and sequencing. Another approach to search for repeats is to screen a genomic library using labeled genomic DNA as probe. This approach can be effectively used to identify abundant or highly repetitive sequences in the genome, which is based on the principle that only the repetitive fraction of the genome will produce a sufficient amount of labeled fragments that will generate hybridization signals during the screening (Gale, 1987; Cockburn and Mitchell, 1989). A third method is to use Cot analysis to help identify repetitive sequences in the genome, which is based on DNA reassociation kinetics (Adams et al., 1986; Peterson et al., 2002). For example, Cot analysis of genomic DNA can be performed to isolate moderately a repetitive portion of the genome that tends to contain TEs, and a subgenomic library can be constructed using this fraction of genomic DNA to search for potential TEs. Several methods can be used to identify and isolate TEs on the basis of information derived from related TEs. For example, homologous TE probes may be used in Southern blotting and genomic library screening experiments to identify related TEs. PCR using primers that are conserved between related TEs can also be used to isolate different members of a TE family.

Computational Approaches to Discover and Analyze TEs

The completion of a number of insect genome projects and the ongoing genome revolution fueled by the rapidly improving "next-generation" sequencing technologies provide an ever-expanding sea of data that can be explored to identify interspersed TEs. As described below, a number of new tools have been developed which represent a shift from merely masking TEs (RepeatMasker; reviewed by Jurka, 2000) to the discovery, annotation, and genomic analysis of TEs. The use of bioinformatics tools provides great advantages by allowing analysis of TEs in the entire genome, and by allowing quick surveys of a large number of TE families to identify the most promising candidates for discovering active TEs (see section 3.5) and for population analysis (section 3.9). It should be noted that these approaches are not limited to fully sequenced genomes. Because of the repetitive nature of TEs, sequences from a small fraction of a genome tend to contain a large number of TEs that may be discovered using bioinformatic approaches described here. Of course, a greater number of sequences and longer assembly would be beneficial in analyzing low-copy number or long TE sequences.

Homology-dependent approaches Searching for TEs in a genome on the basis of similarities to known elements discovered in different species is relatively straightforward. However, given their diversity and abundance, systematic computational approaches are necessary for efficient and comprehensive analysis. One such program was reported (Berezikov et al., 2000) that uses profile hidden Markov models to find all sequences matching the full-length reverse transcriptase with the conserved FYXDD motif common to all reverse transcriptases. We previously developed a BLAST-based systematic approach to simultaneously identify and classify TEs (Biedler and Tu, 2003). This approach incorporates multi-query BLAST (Altschul et al., 1997) and a few computer program modules (available at tefam.biochem.vt.edu) that organize BLAST output, retrieve sequence fragment, and mask database for identified TEs. The method was successfully used to discover and characterize non-LTR retrotransposons in the An. gambiae genome assembly. More recently, other programs that automate the discovery and annotation of TEs within a genome assembly have also been reported (e.g., Rho et al., 2007; Rho and Tang, 2009).

Homology-independent methods There are a few computer programs that uncover certain groups of TEs based on their structural characteristics, rather than specific sequence homologies. For example, FINDMITE1 searches the database for inverted repeats flanked by user-defined direct repeats within a specified distance (Tu, 2001a). There is also a program, named MAK, that uncovers MITEs as well as reporting the associations of MITEs with neighboring genes and related autonomous DNA transposons (Yang and Hall, 2003). LTR_STRUC is a program that identifies LTR retrotransposons on the basis of the presence of long terminal repeats (most LTRs contain TG.. .CA termini), target site duplications, and additional information such as primer binding site and polypurine tract (McCarthy and McDonald, 2003). Although it is not designed to uncover solo LTRs or truncated non-LTR retrotransposons, the program offers a rapid and efficient approach to systematically identify and characterize LTR retrotransposons in a given genome. It can be used as a discovery tool for new families of LTR retrotransposons. Recent programs such as LTR_ FINDER, LTRharvest, and LTRdigest offer further improvements with regard to efficiency and sensitivity in the discovery and annotation of LTR retrotransposons (Xu and Wang, 2007; Ellinghaus et al., 2008; Steinbiss et al., 2009).

There are a small number of programs that identify TE sequences on the basis of their repetitive nature in the genome. The most commonly used programs include Recon, ReAS, RepeatGluer, RepeatScout RepeatFinder, and PILER (Volfovsky et al., 2001; Bao and Eddy, 2002; Pevzner et al, 2004; Edgar and Myers, 2005; Li et al, 2005; Price et al., 2005). For example, RepeatFinder uses a clustering method to analyze repetitive sequences (Vol-fovsky et al., 2001), and RECON (Bao and Eddy, 2002) uses a multiple sequence alignment algorithm to identify all repetitive sequences. RepeatScout is a user-friendly and rapid repeat finding program that uncovers repeats by extending consensus seed sequences (Price et al., 2005). ReAS can be used for TE discovery from whole-genome shotgun sequences. Saha and colleagues compared all six de novo repeat discovery programs, and found Repeat-Scout to be most efficient for analysis of genome assemblies and ReAS to be most efficient for analysis of shotgun sequences (Saha et al., 2008). In addition to the repeat finding programs discussed above, there is also a novel approach that is based on comparison of whole-genome alignments of closely related Drosophila species to identify TE insertions as revealed by disruptions of conservation (Caspi and Pachter, 2006). This approach can potentially identify the boundaries of TE insertions and allow the inference of the age of insertion. It may become widely used as more genome assemblies of closely related species are made available. It is often a daunting task to classify or annotate a large number ofTEs uncovered using the ab initio or de novo approaches mentioned above. Programs such as TEpipe (Biedler and Tu, 2003), REPCLASS (Feschotte et al, 2009), and MGESCAN-non-LTR (Rho et al., 2009) are designed to automate the TE classification process.

Diversity and Characteristics of Insect TEs

Overview

Virtually all classes and types of eukaryotic TEs have been found in insects. Insect TEs such as copia, gypsy, I, R1, P, mariner, hobo, piggyBac, and transib are the founding members of several diverse families/superfamilies that were later shown to have broad distributions in eukary-otes. In addition, recent studies have revealed a few novel and intriguing TEs in insects, which are described in detail below. The previous review (Tu, 2005) provided a relatively extensive compilation of the two classes of TEs in insects. It is not possible to discuss here all the new insect TEs discovered since then; instead, I will focus on recent advances and interesting features of some novel insect TEs, and describe new insights obtained from comparative genomic analysis.

Recent Advances

Class I TEs Recent discovery of the use of tyrosine recombinase instead of integrase in some LTR retrotransposons further highlighted the flexibility in domain acquisition by LTR elements (Eickbush and Jamburuthugoda, 2008). The acquisition of the env-like protein by some LTR retrotransposons such as gypsy confers the ability to leave the cell and become infectious retroviruses (Eickbush and Malik, 2002). There are a few recent surveys that revealed great diversity of LTR retrotransposons in sequenced insect genomes (Tubio et al., 2004, 2005; Xu et al, 2005; Nene et al, 2007; Tribolium Genome Sequencing Consortium, 2008; Minervini et al, 2009; Arensburger et al, 2010). Of the 17 clades of non-LTR retrotransposons, 12 have been found in insects (Eickbush and Malik, 2002; Biedler and Tu, 2003). In fact, the founding members of many of these clades were discovered in insects. Two new clades, named Loner and Outcast, were discovered in An. gambiae (Biedler and Tu, 2003). Recent surveys of the genomes of the flour beetle, the silkworm, and the Culicinae mosquito also revealed highly diverse non-LTR retrotransposons (Nene et al., 2007; Osanai-Futahashi et al., 2008; Tribolium Genome Sequencing Consortium, 2008; Arensburger et al., 2010). Insect SINEs characterized so far all belong to the tRNA-related group – for example, a SINE discovered in Ae. aegypti, named Feilai, consists of a tRNA-related promoter region, a tRNA-unrelated conserved region, and a triplet tandem repeat at its 3′ end (Figure 2). The Twin SINE family, which was discovered in Culexpipiens (Feschotte et al., 2001), consists of two tRNA-related regions separated by a 39-bp spacer. SINE200 from An. gambiae contains only one of the two conserved boxes found in tRNA-related Pol III promoters (Santolamazza et al., 2008). A recently discovered SINE in the silkworm is frequently found in the untranslated exons (UTRs) of genes (Xu et al., 2010).

Class II TEs Several cut-and-paste DNA transposons from insects are the founding members of their respective families/superfamilies that have broad distributions. The families/superfamilies that have been found in insects include IS630-Tc1-mariner, hAT, Merlin, piggyBac, PIF/Harbinger, P, and Transib (Shao and Tu, 2001; Robertson, 2002; Kapitonov and Jurka, 2003a; Feschotte, 2004). Conserved transposase sequences and TSDs of specific sequence or length are the hallmarks of each family/superfamily.The structural characteristics of representative elements from each family are shown in Figure 3. Recent surveys showed broad distribution of piggyBac transposons in insects (Handler et al., 2008; Wang et al., 2008, 2010). Recent expansion and reclassification of the IS630-Tc1-mariner superfamily will be discussed below as an example of the diversity of transposon superfamilies (Shao and Tu, 2001; Coy and Tu, 2005). Structural analysis of the Hermes (hAT) transposase and the DNA-transposase complex of Mos1 (mariner) further illustrated the molecular mechanisms of the cut-and-paste process (Hickman et al., 2005; Richardson et al., 2009). Episomal hAT elements, recently recovered in insects, may be maternally transmitted and influence transposition (O’Brochta et al, 2009). MITEs that share similar TSDs and TIRs with cut-and-paste DNA transposons in the IS630-Tc1-mariner, hAT, piggyBac, and PIF/Harbinger families have been found in mosquitoes (Tu, 1997, 2001a; Holt et al., 2002; Nene et al., 2007; Arensburger et al., 2010). A large number of MITEs have been discovered in silkworm (Coates et al., 2010a; Han et al., 2010). A MITE that generates TA-specific TSDs has also been reported in a Coleopteran insect (Braquart et al., 1999). Two hAT-like MITEs have been found in D. willistoni (Holyoake and Kidwell, 2003), and a deletion-derivative of the pogo transposon has been found in D. melanogaster (Feschotte et al, 2002). As in mosquitoes (Tu, 1997), MITEs were found to be associated with genes in Helicoverpa zea (Chen and Li, 2007). Advances in the complex DNA transposons Helitrons and Polintons/Mavericks are discussed in section 3.4.3.

Intriguing Insect TEs

Two intriguing families of Class I TEs: Maque and Penelope A family of very short interspersed repetitive elements named Maque has recently been found in An. gambiae. There are approximately 220 copies of Maque. Only approximately 60 bp long, Maque has the appearance of a distinct transposition unit. The majority of Maque elements were flanked by 9- to 14-bp TSDs. Maque has several characteristics of non-LTR retrotransposons, such as TSDs of variable length, imprecise 5′ terminus, and CAA simple repeats at the 3′ end. The evolutionary origin of Maque and the differences between Maque and other known retro-elements including SINEs is not yet known. We suggest that the 5′ end of Maque represents a strong stop position that causes frequent premature termination of reverse transcription (Tu, 2001b). Although no autonomous non-LTR retrotransposons have been found that share similar 3′ sequences with Maque, there is a family of non-LTR retrotransposons, Ag-I-2 (Biedler and Tu, 2003), that have the same CAA tandem repeats at their 3′ termini. It is possible that short sequences such as Maque which contain just the reverse transcriptase recognition signal could potentially contribute to the genesis of some primordial SINEs (Tu, 2001b). Insertion polymorphism of this element and the SINE200 was used to study the incipient speciation between the M and S molecular forms of An. gambiae (Barnes et al., 2005; Santolamazza et al., 2008).

Penelope, another intriguing family, was discovered as a TE involved in the hybrid dysgenesis of crosses between field-collected and laboratory strains of D. virilis (Evgen’ev et al., 1997). It has a reverse transcriptase that is grouped with the reverse transcriptase from telomerase (Arkhipova et al., 2003). More strikingly, Penelope and related elements in bdelloid rotifers are able to retain their introns, which is inconsistent with a transposition mechanism involving an RNA intermediate. It was proposed that the Uri endonuclease domain found in all Penelopelike elements may allow them, at least in part, to use a DNA-mediated mechanism similar to that used by group I introns (Arkhipova et al., 2003). On the basis of these unique features and phylogenetic analysis of Penelope-like elements in diverse eukaryotes, Penelope was classified as a unique group that is distinct from LTR and non-LTR retrotransposons (Evgen’ev and Arkhipova, 2005).

Classification of the IS630-Tc1-mariner(ITm) superfamily It was previously shown that some pro-karyotic IS elements, eukaryotic Tc1 and mariner trans-posons, and eukaryotic retrotransposons and retroviruses form a megafamily which share similar signature sequences or motifs in the catalytic domain of their respective transposase and integrase (Capy et al., 1996, 1997). The common motif for this transposase—integrase megafamily is a conserved D(Asp)DE(Glu) or DDD catalytic triad. The distance between the first two Ds is variable while the distance between the last two residues in the catalytic triad is mostly invariable for a given transposon family in eukaryotes, indicating functional importance. Within this megafamily, the eukaryotic DNA transposon families Tc1 and mariner and the bacterial IS630 element and its relatives in prokaryotes and ciliates comprise a superfamily, the IS630-Tc1-mariner superfamily, which is based on overall transposase similarities and a common TA dinucleotide insertion target (Henikoff, 1992; Doak et al., 1994; Robertson and Lampe, 1995; Capy et al., 1996; Shao and Tu, 2001). Tc1-like elements identified in fungi, invertebrates, and vertebrates all contain a DD34E motif, while most mariner elements identified in flatworm, insects, and vertebrates contain a DD34D motif. A few TEs that contain DD37D and DD39D motifs were previously regarded as basal subfamilies — the max subfamily and mori subfamily, respectively — of the mariner family (Robertson, 2002). We have reported a novel transposon named ITmD37E in a wide range of mosquito species (Shao and Tu, 2001). The ITmD37E transposases contain a conserved DD37E catalytic motif. Sequence comparisons and phylogenetic analyses suggest that ITmD37E is a new family, and that the mori subfamily (DD37D) and max subfamily (DD39D) of mariner may also be classified as two distinct families, namely the ITmD37D and ITmD39D families (Figure 4). The recognition of the three new families, ITmD37E, ITmD37D, and ITmD37D, is consistent with the fact that they share family-specific catalytic motifs and similar TIRs. Claudianos and colleagues also noticed the need for reclassification of the DD37D transposons, and named them the maTfamily (Claudianos et al., 2002). A group of transposons that contain a DD41D catalytic motif have been found in the medfly Ceratitis rosa, establishing yet another family (Gomulski et al., 2001; Robertson and Walden, 2003); namely, the ITmD41D family.

Figure 4 Structural features and classification of the IS630-Tc1-mariner superfamily. (A) Structural features. The catalytic triad in the transposase is highlighted. The characteristic TA target site duplications (TSDs) flanking an IS630-Tc1-mariner are shown. The terminal inverted repeats (TIRs) specify the boundaries of the element. Possible introns are not shown. (B). Phylogenetic relationship between members of the IS630-Tc1-mariner superfamily on the basis of the catalytic domain. The alignment used here was previously described (Shao and Tu, 2001). The tree shown is an unrooted phylogram constructed using a minimum evolution algorithm. Two additional methods, neighbor-joining and maximum parsimony, were also used. Confidence of the groupings was estimated using 500 bootstrap replications. The bootstrap value represents the percent of times that branches were grouped together at a particular node. The first, second, and third numbers represent the bootstrap value derived from minimum evolution, neighbor-joining, and maximum parsimony analysis, respectively. Only the values for major groupings are shown. Various colors indicate different clades. All phylogenetic analyses were conducted using PAUP 4.0 b8 (Swofford, 2001).

In summary, according to recent analyses, the IS630-Tc1-mariner superfamily can be organized into seven families, ITmD37E, ITmD37D, ITmD39D, ITmD41D, Tc1, mariner, and pogo, and an unresolved clade which includes bacterial IS630-like elements and some fungal and ciliate transposons (Figure 4). Pogo is an interesting case, as it has a unique N-terminal DNA-binding domain and a long C-terminal domain rich in acidic residues, although it contains a DDxD catalytic domain related to IS630-Tc1-mariner transposons (Smit and Riggs, 1996). Recently discovered Gambol elements are distinct from the Tc1 elements, according to phylogenetic analysis, although both families contain the DD34E catalytic triad (Coy and Tu, 2005).

Microuli, a miniature subterminal inverted-repeat TE Microuli is a family of small (~200 bp) and highly AT rich (68.8-72.6%) TEs found in Ae. aegypti that do not have any coding capacity (Tu and Orphanidis, 2001). There is a 61- to 62-bp internal subterminal inverted-repeat as well as a 7-bp subterminal inverted-repeat 11 bp from the two termini. In addition, there are three imperfect subterminal direct repeats near the 5′ end. All of the above characteristics clearly resemble the structural features of MITEs. The only feature that separates Microuli from MITEs is that Microuli elements lack TIRs. Therefore, we use the phrase "miniature subterminal inverted-repeat transposable elements," or MSITEs, to refer to the structural characteristics of the Microuli elements. Short insertion sequences that contain subterminal inverted-repeats but lack TIRs have been identified in the genomes of rice and a Culex mosquito (Song et al, 1998; Feschotte and Mouches, 2000b). Of the 19 nucleotides at the 5′ (and only 5′) terminus of Microuli, 14 are identical to the TIR of Wuneng, a previously characterized MITE in Ae. aegypti (Tu, 1997). Both Microuli and Wuneng insert specifically into the TTAA target. It has been suggested that MITEs and the autonomous DNA transposons share the same transposition machinery based on common TIRs (Feschotte et al., 2002). Then, how did Microuli transpose without the TIRs? The three subterminal direct repeats could potentially be the binding sites for transposases, because subterminal inverted repeats and subterminal direct repeats have been shown to bind transposases in several autonomous DNA transposons (Morgan and Middleton, 1990; Beall and Rio, 1997; Becker and Kunze, 1997). It remains unclear how the termini of Microuli are determined at the strand cleavage step without the TIR. The TTAA target duplication plus a 3-bp TIR are essential for the excision of the autonomous transposon piggyBac (Bauser et al., 1999). Therefore, it is possible that Microuli may also be able to use the TTAA target sequence as part of the signal for recombination. It is tempting to hypothesize that some MITEs could evolve from MSITEs through mutation and/or recombination events at the termini which would result in TIRs. Similar elements with subterminal inverted repeats have been recently found in Helicoverpa zea (Coates et al., 2010a).

Complex DNA transposons: Helitrons and Polintons/Mavericks Helitrons use a rolling-circle mechanism for transposition, and they have been found in D. melanogaster, An. gambiae, and Lepidopteran species (Kapitonov and Jurka, 2001, 2003a; Coates et al, 2010b). Insect Helitrons have several characteristics, including short specific terminal sequences (5′ TC and 3′ CTAG), a 3′ hairpin, and the lack of TSDs. Instead of a cut-and-paste transposase, Helitron1 in An. gambiae encodes an intronless protein including domains similar to helicase and replication initiation protein. There are approximately 100 copies of Helitron elements in A. gambiae, which form 10 distinct families (Kapitonov and Jurka, 2003a).

Politons/Mavericks are broadly distributed in metazoa, fungi, and various single-cell eukaryotes (Feschotte and Pritham, 2005; Kapitonov and Jurka, 2006; Pritham et al. , 2007). These elements can be 20 kb in length and share several features, including 6-bp TSD, long TIRs, and coding sequences for integrase, DNA polymerase, and a few other proteins. They appear to be related to adenoviruses, bacteriophages, and eukaryotic linear plas-mids. It is proposed that an excised Polinton/Maverick can self-replicate with its own polymerase and integrate into the genome using its integrase. This group represents the most complex DNA transposons to date, and they are found in Drosophila and Tribolium (Pritham et al., 2007)

Insights from Comparative Genomic Analysis

There are 31 insect genome assemblies available at the National Center for Biotechnology Information (NCBI) Genome Project Database (ncbi.nlm.nih.gov/entrez): 19 genome assemblies are available from Dipteran species, including 1 Hessian fly (http://www.ncbi.nlm.nih.gov/genomeprj/45867) and 6 mosquitoes (Holt et al., 2002; Nene et al., 2007; Arensburger et al., 2010; Lawnic-zak et al., 2010; http://www.ncbi.nlm.nih.gov/genomeprj/46227); and 12 Drosophila (Drosophila 12 Genomes Consortium, 2007). There are assemblies from seven Hymenopteran species, including the honeybee (Honeybee Genome Sequencing Consortium, 2006), three ants (Bonasio et al., 2010; http://www.ncbi.nlm.nih.gov/genomeprj/48091), and three wasps (Werren et al., 2010). There are also assemblies from one Lepidopteran species, Bombyx mori (Xia et al, 2004; Mita et al, 2004), and one Coleopteran species, Tribolium castaneum (Tribolium Genome Sequencing Consortium, 2008). Assemblies from three hemimetabolous insects are available, including one Phthiraptern insect body louse (Kirkness et al., 2010) and two Hemipteran (The International Aphid Genomics Consortium, 2010; http://www.ncbi.nlm.nih.gov/genomeprj/13648). In addition, a number of insect genomes are being sequenced using "next-generation" approaches, and it is anticipated that rapid expansion of sequenced genomes will bring tremendous opportunities to the investigation of TE diversity and evolution. Whole-genome comparative analysis of insect TEs is still in its early stages, and a few interesting observations are highlighted below. Systematic analysis of the 12 Drosophila genomes revealed that while the TE content varies from 2.7% to ~25% of the host genomes, the relative abundance of different groups of TEs is conserved across most of the species (Drosophila 12 Genomes Consortium, 2007). Comprehensive analysis identified over 100 potential horizontal transfer events by more than 20 TEs among the 12 Drosophila species, most of which involved DNA transposons and LTR retrotransposons (Loreto et al., 2008; Bartolome et al., 2009). Systematic comparison of multiple aligned genomes revealed TE insertion sites across the entire genomes, and supported a hypothesis that most TEs in D. melanogaster are recently active (Caspi and Pachter, 2006). The published genomes of Anopheles, Culex, and Aedes mosquitoes vary by five-fold in size, ranging from ~270 Mbp for An. gambiae (Holt et al., 2002) to ~500Mbp for C. quinquefasciatus (Arensburger et al., 2010), and ~1300 Mbp for Ae. aegypti (Nene et al., 2007). TE contents in these three species are 11-16%, 29%, and 47% of the assembled genomes, respectively, indicating that TEs contributed significantly to the genome size variations among mosquito species. While 16% of the Ae. aegypti genome is occupied by MITE-like elements, cut-and-paste DNA transposons represent only 3% of the genome, suggesting that a small number of DNA transposons may be responsible for cross-mobilizing a large number of non-autonomous MITE-like sequences (Nene et al., 2007). Systematic comparisons also revealed an apparent horizontal transfer event between Aedes and Anopheles mosquitoes involving an ITmD37E DNA transposon (Biedler and Tu, 2007). Among the sequenced Hymenopteran species, the honeybee genome contains only ~7% repetitive sequences while repeat contents range from 15 to 27% in the ants and wasps (Honeybee Genome Sequencing Consortium, 2006; Bonasio et al., 2010; Werren et al., 2010). The parasitic body louse harbors only a very small number of TEs, which occupy 1% of its 110-Mbp genome (Kirkness et al., 2010).

Search for Active TEs in Insect Genomes

Active TEs may be used as tools for the genetic manipulation of insects for basic and applied research (see section 3.9). In addition, the behavior of TEs in host genomes and their spread in natural populations may be studied by monitoring active TE families. It is therefore highly desirable to isolate active copies of TEs. As described in section 3.3, TEs discovered from observations of genetic mutations tend to result from active transposition events. Although several active TEs were discovered in this manner, this discovery process relies heavily on fortuitous events. Several methods that may facilitate the search for active TEs in insect genomes are described below.

Identification of Potentially Active TEs on the Basis of Bioinformatic Analysis

As discussed above, the ongoing genome revolution has produced an immense quantity of sequence data from which diverse TEs can be identified in various insect genomes. The computational programs described in section 3.3 can greatly facilitate the discovery and characterization of a large number of TE families. Unfortunately, the vast majority of TEs have accumulated inactivating mutations during evolution, rendering the discovery of active TEs a task similar to finding needles in a haystack. Bioinformatic analysis can provide leads to potentially active candidates that can be studied further. For example, using a semi-automated reiterative search strategy, we identified many potentially active families of non-LTR retrotransposons in the An. gambiae genome (Biedler and Tu, 2003). Here, candidate families were identified based on sequence characteristics, which include the presence of full-length elements, intact open reading frames, multiple copies with high nucleotide identity, and the presence of TSDs. High nucleotide identity indicates recent amplification from a source element, without enough time for divergence caused by nucleotide substitution and other mutations. It should be emphasized that sequence analysis can only provide leads for further analysis. For example, high sequence identity between copies of a TE family may not always indicate recent transposition activity because it can also result from gene conversion events. Using the bioinformatics principles described above, a mosquito hAT element named Herves and an ant mariner element named Mboumar were identified and subsequently shown to support transposition (Arensburger et al., 2005; Munoz-Lopez et al., 2008).

Detection of TE Transcription

Transcription is a required step during transposition of the RNA-mediated TEs. Although DNA-mediated TEs do not use RNA as an intermediate, transcription is required for production of transposase proteins. Therefore, the detection of transcription may offer further support for an active family in both classes of TEs. Transcription can be inferred if a match is found in an expressed sequence tag (EST) database to a TE sequence from the same organism. For example, 21 families of non-LTR retrotransposons had significant hits when BLAST searches were carried out against over 94,000 An. gambiae ESTs downloaded from NCBI (Biedler and Tu, 2003). Comparisons of TEs against high-throughput illumina sequencing databases may also reveal TE transcription. Transcription of TEs may also be detected experimentally by RT-PCR and Northern blot. The source of mRNA may affect the outcome of these experiments, because the activity of some TEs may be temporally and spatially controlled. Recent analysis showed that transcription of the hobo transpo-son may be developmentally regulated in D. melanogaster (Depra et al., 2009). It has been shown that TE activity can be elevated during the culturing of mammalian and plant cells (Wessler, 1996; Grandbastien, 1998; Liu and Wendel, 2000; Kazazian and Goodier, 2002). Different cell lines are available for a number of insect species. One caveat of the above approach is that transcripts shown by either experimental detection or EST analysis could arise from spurious transcription. These transcripts could originate by transcription from a nearby host promoter.

Figure 5 TE display, a method to scan multiple insertion sites of a TE in the genome. (A) Principle of TE display, which is a modified form of Amplified Fragment Length Polymorphism (AFLP). The difference is that TE-specific primers (F1 and F2) are used in addition to the adapter primer (R1). F2 is labeled as shown by the asterisk. (B). Partial image of a TE display using primers for the Pegasus element with eight female individuals from an Anopheles gambiae colony (GAMCAM) originally collected from Cameroon (Biedler et al., 2003). The eight samples on the left are amplified with a Pegasus-specific primer, Peg-F2. The eight samples on the right are the same as those on the left except they were amplified with primer Peg-F3, which is designed to amplify a product smaller by three bases. The three-base shift is clearly observable. A size marker is shown on the right. Bands from a TE display gel were re-amplified and sequenced, showing that they contained Pegasus sequences as well as flanking genomic and adapter sequences in the expected order (not shown). Co-migrating bands among different individuals had the same flanking genomic sequence, indicating that they were from the same genomic locus. Note that Pegasus is a MITE.