Insect Genomics Part 2

Sequenced Genomes

Fruit fly, Drosophila melanogaster. The D. melano-gaster sequencing project used several types of sequencing strategies, including sequencing of individual clones, and sequencing of genomic libraries with three insert sizes (Adams et al., 2000). A portion of the D. melanogaster genome corresponding to approximately 120 megabases of euchromatin was assembled. This assembled genomic sequence contained 13,600 predicted genes. Some of the proteins coded by these predicted genes showed high similarity with vertebrate homologs involved in processes such as replication, chromosome segregation, and iron metabolism. About 700 transcription factors have been identified based on their sequence similarity with those reported from other organisms. Half of these transcription factors are zinc-finger proteins, and 100 of them contained homoeodomains. Genome sequencing identified 22 additional homeodomain-containing proteins and 4 additional nuclear receptors. Nuclear receptors are sequence-specific ligand-dependent transcription factors that function as both transcriptional activators and repressors, and which regulate many physiological and metabolic processes. The D. melanogaster genome encodes 20 nuclear receptor proteins. General translation factors identified in other sequenced genomes are also present in the D. melanogaster genome. Interestingly, the D. melanogaster genome contained six genes encoding proteins highly similar to the messenger RNA (mRNA) cap-binding protein, eIF4E, suggesting that there may be an added level of complexity to regulation of cap-dependent translation in the fruit fly. The cytochrome P450 monooxygenases (P450s) are a large superfamily of proteins that are involved in synthesis or degradation of hormones and pheromones, as well as the metabolism of natural and synthetic toxins and insecticides.About 20% of the proteins encoded by the D. melanogaster genome are likely targeted to the cellular membranes, since they contain four or more hydropho-bic helices. The largest families of membrane proteins are sugar permeases, mitochondrial carrier proteins, and the ATP-binding cassette (ABC) transporters coded by 97, 38, and 48 genes respectively.

Table 1 List of Sequenced Genomes

Genome size (Mb)

Number of genes predicted

Common name

Scientific name


Beetle, Red flour

Tribolium castaneum



Richards et al., 2008

Fruit fly

Drosophila ananassae



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila erecta



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila grimshawi



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila melanogaster



Adams et al., 2000

Fruit fly

Drosophila mojavensis



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila persimilis



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila pseudoobscura



Richards et al., 2005

Fruit fly

Drosophila sechellia



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila simulans



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila virilis



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila willistoni



Drosophila 12 Genome Consortium, 2007

Fruit fly

Drosophila yakuba



Drosophila 12 Genome Consortium, 2007

Honey bee

Apis mellifera



The Honey Bee Genome Consortium, 2006

Louse, body

Pediculus humanus



Kirkness et al., 2010

Malaria mosquito

Anopheles gambiae



Holt et al., 2002

Yellow fever mosquito

Aedes aegypti



Nene et al., 2007

Southern house mosquito

Culex quinquefasciatus



Arensburger et al., 2010

Pea aphid

Acyrothosyphon pisum



The Pea Aphid Genome Consortium, 2010

Wasp, parasitoid

Nasonia vitripennis Nasonia giraulti Nasonia longicornis



Werren et al., 2010


Bombyx mori



The International Silkworm Genome Consortium, 2008

Among the proteins involved in biosynthetic networks, 31 triacylglycerol lipases that are involved in lipolysis and energy storage and redistribution and 32 uridine diphosphate (UDP) glycosyl transferases (which participate in the production of sterol glycosides and in the biodegradation of hydrophobic compounds) are encoded by the D. melanogaster genome. One additional ferritin gene and two additional transferrin genes have been identified by genome sequencing.

In 2005, Richards and colleagues published the genome of a second Drosophila species, Drosophila pseudoobscura (Richards et al., 2005). In 2007 the Drosophila Genome Consortium completed the sequencing of 10 additional Drosophila genomes: D. sechellia; D. simulans; D. yakuba; D. erecta; D. ananassae; D. persimilis; D. willistoni; D. mojavensis; D. virilis; and D. grimshawi (Drosophila 12 Genome Consortium, 2007). Comparative analysis of sequences from these 10 genomes and the 2 genomes published earlier (D. melanogaster and D. pseudoobscura) identified many changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. Many characteristics of the genomes, such as the overall size, the total number of genes, the distribution of transpos-able element classes, and the patterns of codon usage, are well conserved among these 12 genomes. Interestingly, a number of genes coding for proteins involved environmental interactions, and reproduction showed rapid change. In these 12 genomes, microRNA genes are more conserved than the protein-coding genes.Genome-wide alignments of the 12 Dro-sophila species resulted in the prediction and refinement of thousands of protein-coding exons, genes coding for RNAs such as miRNAs, transcriptional regulatory motifs, and functional regulatory regions (Stark et al., 2007). For more information on comparative analysis of 12 Drosoph-ila species genomes, the reader is directed to Ashburner’s excellent preface article (Ashburner, 2007).

Malaria mosquito, Anopheles gambiae. 278 Mb of genome sequence from An. gambiae was obtained by the WGS method (Holt et al., 2002). About 10-fold coverage of the genome sequence was achieved. The size of the assembled An. gambiae genome is larger than that of D. melanogaster (120 Mb). About 14,000 predicted genes were identified in the assembled genome sequence. When compared to the D. melanogaster genome, the An. gambiae genome contained 100 additional serine proteases, central effectors of innate immunity, and other proteolytic processes.The presence of additional serine proteases in An. gambiae may be due to differences in feeding behavior, as well as its intimate interactions with both vertebrate hosts and parasites. Also, 36 additional proteins containing fibrinogen domains (carbohydrate-binding lectins that participate in the first line of defense against pathogens by activating the complement pathway in association with serine proteases) and 24 additional cadherin domain-containing proteins were found in An. gambiae. Most of the genes coding for transcription factors, the C2H2 zinc-finger, POZ, Myb-like, basic helix—loop—helix, and homeodomain-containing proteins reported from sequenced genomes are also present in the An. gambiae genome. An over-representation of the MYND domain was observed in the An. gambiae genome. This domain is predominantly found in chroma-tin proteins, which are believed to mediate transcriptional repression.

Genes coding for proteins involved in the visual system, structural components of the cell adhesion and contractile machinery, and energy-generating glycolytic enzymes that are required for active food seeking are present in higher numbers in the An. gambiae genome when compared with the D. melanogaster genome. Genes coding for salivary gland components, as well as anabolic and catabolic enzymes involved in protein and lipid metabolism, are over-represented in the An. gambiae genome. Genes coding for proteins involved in insecticide resistance, such as transporters and detoxification enzymes, were also found in higher numbers in the An. gambiae genome when compared to their numbers in the D. melanogaster genome.

Red flour beetle, Tribolium castaneum. The 160-Mb T. castaneum genome sequence was obtained by WGS, and contained 16,404 predicted genes (Richards et al., 2008). The T. castaneum genome showed expansions in odor-ant and gustatory receptors, as well as P450s and other detoxification enzyme families.In addition, the T. castaneum genome contained more ancestral genes involved in cell—cell communication when compared to other insect genomes sequenced to date. RNA interference is systemic in T. castaneum, and thus works very well. The SID-1 multi-transmembrane protein involved in double-stranded RNA (dsRNA) uptake in C. elegans was not found in D. melanogaster. However, three genes that encode proteins similar to SID-1 were found in the T. castaneum genome. Expansions of odorant receptors, CYP proteins, proteinases, diuretic hormones, a vasopressin hormone and receptor, and chemoreceptors suggest that these adaptations allowed T. castaneum to become a serious pest of stored grain.

Honeybee, Apis mellifera. The 236-Mb A. mellifera genome was assembled based on 1.8 Gb of sequence obtained by WGS (The Honey Bee Genome Consortium, 2006). About 10,157 potential genes were identified in the assembled genome sequence. Genes coding for most of the highly conserved cell signaling pathways are present in the A. mellifera genome. Seventy four genes coding for 96 homeobox domains were identified in the A. mellifera genome. When compared to the D. melanogaster genome, the A. mellifera genome contained more genes coding for odorant receptors and proteins involved in nectar and pollen utilization. This genome also showed fewer genes coding for proteins involved in innate immunity, detoxification enzymes, cuticle-forming proteins, and gustatory receptors.

Parasitoid wasps, Nasonia vitripennis, N. giraulti, and N. longicornis. 240 Mb of N. vitripennis genome was assembled from sequences obtained by the Sanger sequencing method (Werren et al., 2010). Sequences from two sibling species, N. giraulti and N. longicornis, were completed with one-fold Sanger and 12-fold, 45 base-pair (bp) Illumina genome coverage. The assembled genome sequence contained 17,279 predicted genes. About 60% of Nasonia genes code for proteins showing high similarity with human proteins, 18% of the genes code for proteins showing similarity with other arthropod homologs, and about 2.4% of Nasonia genes code for proteins similar to those in A. mellifera, which could therefore be hymenop-tera-specific. About 12% of genes code for proteins that showed no similarity with known proteins, and therefore may be Nasonia-specific.

Body louse, Pediculus humanus humanus. 108 Mb of P. h. humanus genome was assembled from 1.3 million pair-end reads from plasmid libraries obtained by WGS (Kirkness et al., 2010). The body louse has the smallest genome size of all the insect genomes sequenced so far. The assembled genome contained 10,773 protein-coding genes and 57 microRNAs. Compared with other insect genomes, the body-louse genome contains significantly fewer genes associated with environmental sensing and response. These proteins include odorant and gustatory receptors and detoxifying enzymes. Only 104 non-sensory G protein-coupled receptors and 3 opsins were identified in P. h. humanus genome. This insect has the smallest repertoire of GPCRs identified in any sequenced insect genome to date. Only 10 odorant receptors were detected in P. h. humanus genome. Only 37 genes in the P. h. humanus genome encode for P450s. Despite its smaller size, the P. h. humanus genome contains homologs of all 20 nuclear receptors identified in D. melanogaster genome.

Pea aphid, Acyrthosiphon pisum. The 464-Mb genome of A. pisum was assembled from 4.4 million Sanger sequencing reads (The Pea Aphid Genome Consortium, 2010). Analysis of the A. pisum genome showed extensive gene duplication events. As a result, the aphid genome appears to have more genes than any of the previously sequenced insects. Genes coding for proteins involved in chromatin modification, miRNA synthesis, and sugar transport are over-represented in the A. pisum genome when compared with other insect genomes sequenced to date. About 20% of the predicted genes in the A. pisum genome code for proteins with no significant similarity to other known proteins. Proteins involved in amino acid and purine metabolism are encoded by both host and symbiont genomes at different enzymatic steps. N Selenocysteine biosynthesis is not present in the pea aphid, and selenoproteins are absent. Several genes in the A. pisum genome were found to have arisen from bacterial ancestors and some of these genes are highly expressed in bacteriocytes, which may function in the regulation of symbiosis. Interestingly, the genes coding for proteins that function in the IMD pathway of the immune system are absent in the A. pisum genome.

Yellow fever Mosquito, Aedes aegypti. The 1.38-Gb genome of Ae. aegypti was assembled from sequence reads obtained by WGS (Nene et al., 2007). This is the largest insect genome sequenced to date, and is about five times larger than the An. gambiae and D. melanogaster genomes. Approximately 47% of the Ae. aegypti genome consists of transposable elements. The presence of large numbers of transposable elements could have contributed to the larger size of the Ae. aegypti genome. About 15,419 predicted genes were identified in the assembled genome. Compared to the genome of An. gambiae, an increase in the number of genes encoding odorant binding proteins, cytochrome P450s, and cuticle proteins was observed in the Ae. aegypti genome.

Silk moth, Bombyx mori. The silkworm genome was sequenced by Japanese and Chinese laboratories simultaneously. The Japanese group used the sequence data derived from WGS to assemble 514 Mbs including gaps, and 387 Mbs without gaps (Mita et al., 2004). Chinese scientists assembled sequences obtained by WGS into a 429-Mb genome (Xia et al., 2004). The two data sets were merged and assembled recently (The International Silkworm Genome, 2008). This resulted in the 8.5-fold sequence coverage of an estimated 432-Mb genome. The repetitive sequence content of this genome was estimated at 43.6%. Gene models numbering 14,623 were predicted using a GLEAN-based algorithm. Among the predicted genes, 3000 of them showed no homologs in insects or vertebrates. The presence of specific tRNA clusters, and several sericin gene clusters, correlates with the main function of this insect: the massive production of silk.

Recently, a consortium of international scientists sequenced the genomic DNA of 40 domesticated and wild silkworm strains to coverage of approximately threefold. This represents 99.88% of the genome, and led to the development of a single base-pair resolution silkworm genetic variation map (Xia et al., 2009). This effort identified ~16 million single-nucleotide polymorphisms, many indels, and structural variations. These studies showed that domesticated silkworms are genetically different from wild ones; nonetheless, they have managed to maintain large levels of genetic variability. These findings suggest a short domestication event involving a large number of individuals. Candidate genes, numbering 354, that are expressed in the silk gland, midgut, and testes, may have played an important role during domestication.

The southern house mosquito, Culex quinquefascia-tus. C. quinquefasciatus is a vector of important viruses such as the West Nile virus and the St Louis encephalitis virus, and harbors nematodes that cause lymphatic filariasis. Arensburger sequenced and assembled the whole genome of C. quinquefasciatus (Arensburger et al., 2010). A larger number of genes, 18,883, reported from the other two mosquito genomes (Aedes aegypti and Anopheles gambiae), were identified in the assembled C. quinquefasciatus genome. An increase in the number of genes coding for olfactory and gustatory receptors, immune proteins, enzymes such as cytosolic glutathione transferases and cytochrome P450s involved in xenobiotic detoxification was observed.

Genome Analysis

Since its discovery, Sanger sequencing has been largely applied in most genome sequencing projects (Sanger et al., 1977); therefore, a large volume of sequence information from a variety of species has been deposited into various databases. With deciphered full genome sequences for a number of species, scientists could now begin to address biological questions on a genome-wide level. These analyses include the measurement of global gene expression, the identification of functional elements, and the mapping of genome regions associated with quantitative traits. Various new technologies have also been developed to assist with genome analysis. These include DNA microarrays (Schena et al., 1995), serial analysis of gene expression (SAGE) (Schena et al., 1995), chromatin immunoprecipitation microarrays (Ren et al., 2000; Iyer et al., 2001; Lieb et al., 2001), next generation sequencing (NGS) (Margulies et al., 2005; Shendure et al., 2005), genome-wide RNAi screens (Kiger et al., 2003), comparative genomics (Kiger et al., 2003), and metagenomics (Chen and Pachter, 2005). These genomic analysis tools have greatly improved our understanding of how biological and cellular functions are regulated by the RNAs or proteins encoded in an organism’s genome. Especially in the agricultural research field, functional genomics studies will enhance our understanding of the biology of insect pests and disease vectors, which in turn will assist the design of future pest control strategies. Here, we will discuss technologies used for functional genomics studies, with an emphasis on forward genetics, DNA microarray, and NGS technologies, and their applications in research on insects.

Forward and Reverse Genetics

The function of genes is often studied using forward genetics approaches. In forward genetic screens, insects are treated with mutagens to induce DNA lesions, followed by a screen to identify mutants with a pheno-type of interest. The mutated gene is then identified by employing standard genetic and molecular methods.

Follow-up studies on the mutant phenotype, including molecular analyses of the gene, often lead to determination of its function. Forward genetics approaches have been used for determining the function of many genes. In the fruit fly, D. melanogaster, genetic screens have been used for a number of years to discover gene— phenotype associations. With the availability of massive amounts of data derived from whole-genome and omics studies, a systems biology approach needs to be applied to enhance the power of gene function discovery in vivo. Mobile elements or chemicals are often used as muta-genesis tools (Ryder and Russell, 2003). The P element has been widely used in D. melanogaster forward genetics since its development as a tool for transgenesis in 1982 (Rubin and Spradling, 1982). The insertion of P elements into the D. melanogaster genome allowed subsequent cloning and characterization of a large number of fly genes. P-element mediated transgenesis is often used to create mutants by excising the flanking genes based on imprecise mobilization of the P elements. P elements were also modified to study genes, not only based on a phenotype, but also based on RNA or protein expression patterns, which are often referred to as enhancer trap and gene trap technologies. P elements are also being used as mutagenesis agents in a project aimed at generating insertions in every predicted gene in the fruit fly genome.

Recent developments in transgenic techniques focused on the site-specific integration of transgenes at specific genomic sites, which employ recombinases and integrases, have made forward genetics in D. melanogaster effective and specific. One of the major drawbacks of P-element mediated transgenesis is the non-specific and positional effects caused by inserting exogenous DNA into insect genome. Recently, several methods have been developed to eliminate these unwanted, non-specific effects in trans-genic insects. Transgene co-placement was developed by Siegal and Hartl (1996). This method uses two trans-genes, a rescue fragment and its mutant version, which are inserted into the same locus by using a P-element vector that contains the recognition sites FRT (FLP recombinase recognition site) and loxP (the Cre recombinase recognition site). After integration, FLP can remove one trans-gene, such as the rescue gene. Cre can remove the other transgene, which may be the mutant version. A method was developed by Golic (Golic et al., 1997) by using FLP recombinase for remobilization of transgene by a donor transposon that contains a transgenic insert together with a marker gene such as white flanked by two FRT sites, and an acceptor transposon that contains a second marker and one FRT site. The remobilization of the donor transposon by FLP can be followed by the changes in the expression of white gene. The remobilization results in the excision of transgene and its potential integration into the FRT site of the acceptor transposon.

Homologous recombination is the best method for in vivo gene targeting, since positional effects can be eliminated completely. Insertional gene targeting (Rong and Golic, 2000) and replacement gene targeting (Gong and Golic, 2003) are two alternative methods that have been developed. Insertional gene targeting results in the insertion of a target gene at a region of homology. Replacement gene targeting results in replacement of endogenous homologous DNA sequences with exogenous DNA through a double reciprocal recombination between two stretches of homologous sequences. Site-specific zinc-finger-nuclease-stimulated gene targeting has been developed to further improve in vivo gene targeting (Bibikova et al., 2003; Beumer et al., 2006). The most widely used site-specific integration in D. melanogaster employs the bacteriophage $ C31 integrase. The bacteriophage $ C31 integrase catalyzes the recombination between the phase attachment site (attP), previously integrated into the fly genome, and a bacterial attachment site (attB) present in the injected transgenic construct (Groth et al., 2004). A combination of different transgenic methods should aid in D. melanogaster functional genomics studies aimed at determining the function of every gene in this insect.

In the reverse genetics approach, studies on the function of the genes start with the gene sequences, rather than a mutant phenotype, which is often used in forward genetics approaches. In this approach, the gene sequence is used to alter the gene function by employing a variety of methods. The effect of the altered gene function on physiological and developmental processes of insects is then determined. Reverse genetics is an excellent complement to forward genetics, and some of the experiments are much easier to perform using reverse genetics rather than forward genetics. For example, RNA interference, a reverse genetics method is a better method compared to forward genetics to investigate the functions of all the members of a gene family. The availability of whole-genome sequences for a number of insects and the functioning of RNAi in these insects will keep scientists busy studying the functions of all genes in insects during the next few years.

DNA Microarray

In most cases, a group of functionally associated genes share similar expression patterns, which may be temporal, spatial, developmental, or physiological. For example, environmental changes and pathological conditions could alter global gene expression patterns. To understand and characterize the biological roles of an individual gene or a cluster of genes, a high-throughput quantitative method is needed to detect gene expression at the whole-genome level. The DNA microarray technique is one such method that has been developed for monitoring global gene expression patterns. Through robotic printing of thousands of DNA oligonucleotides onto a solid surface, one DNA microarray chip can accommodate more than 50,000 probes (unique DNA sequences). DNA microarrays utilize the principle of Southern blotting (Schena et al., 1995). First, fluorescently labeled probes are synthesized from RNA samples by reverse transcription; the probes are then hybridized to DNA microarrays which contain complementary DNA. After washing away the unbound probes, the intensity of the fluorescent signal for each spot is captured using a microarray scanner. DNA microarrays have been widely used in functional genomics research. In addition to their application on gene expression profiling, DNA microarrays can also be used to identify transcrip-tional or functional elements in the genome, or identify single nucleotide polymorphisms (SNP) among alleles within or between populations. The applications of DNA microarrays and various other types of arrays are listed in Table 2.

Next post:

Previous post: