Grasshopper (Insects)

Genomics

The description of an organism in terms of its genome is termed genomics. As technological advances since the end of the 1990s have made it economically feasible to obtain the entire sequence of an organism’s euchromatic genome, the price of whole-genome sequencing has steadily declined. This has generated an unprecedented quantity of biological data that has been at the forefront of the “genomic revolution.” The fruit fly, Drosophila melanogaster, was one of the first eukaryotes to have its genome sequenced, following on from the completion of yeast and a nema-tode genome projects. The results from genomic studies have had significant impacts on such disparate disciplines as understanding the basis of genome organization, the ancestral relationships between major insect groups (phylogenetics), insect behavior, as well as a better understanding of many individual genes and gene families. Since the publication of the Drosophila genome in the year 2000 more than 25 additional insect genomes have become available with many more in progress (Table I). These new data have lead to the development of new fields of research that seek to understand the information encoded in the nucleotide sequence, as well as how to go beyond nucleotides and examine whole genomes at the protein and metabolic and systems-wide levels. It would be well beyond the scope of this article to give a full review of all the advances in this field, even those made in the last several years. Instead, we will review the insect genomes that have been published to date and give an overview of the methods used to generate the nucleotide sequence of whole genomes, as well as outline a few commonly used techniques to analyze genomic data. Finally, we focus on some developments in the fruit fly and mosquito genomes, insects that have been at the forefront of genomic research in recent years.


PUBLISHED INSECT GENOMES

To date, 25 insect genomes have been published or are currently in advanced stages of being sequenced (Table I). The first published insect genome, that of D. melanogaster in the year 2000, was an obvious choice as it was (and remains) the most-prominent model organism in molecular biology with the most-studied eukaryotic genome to date. Following D. melanogaster, the choice of insect species selected for whole-genome sequencing was driven largely by species of particular interest as vectors of human disease pathogens (mosquitoes, body louse, blood-sucking bug, and tsetse fly) and species of economic importance (honeybee, silk moth, flour beetle, parasitic wasp, and aphid). Additionally, the genomes of 11 additional species in the Drosophila genus were made available in 2007.
The results of these genome-sequencing efforts are publicly available on various online databases (Table I). Even insect genomes that have not been formally published are typically available for public download. The reasons for making these data so quickly and easily available are several. First, all insect genome-sequencing projects have been, in part or wholly, funded with public money obligating the researchers to make the data widely available. Second, it is recognized that publication of such data is of interest to a large number of researchers. Third, publication of a genome does not end with reporting the nucleotide sequences, but also includes ” annotations ” that indicate where on the nucleotide sequence genes begin and end, where gene regulatory sequence are located, where transpos-able elements may be found, and many other features. By making this information available early in the genome-sequencing process, individual researchers not directly involved with the project can help improve the annotation either by correcting previous annotations or submitting their own. Each genome-sequencing project has its own process for such submissions which is typically detailed on the relevant online database.

OBTAINING THE NUCLEOTIDE SEQUENCE OF A GENOME

Any study of an insect’s genome fundamentally requires that the sequence of nucleotide bases that make up the chromosomes and mitochondria be determined. It should be noted that genome-sequencing projects typically concentrate on the euchromatic portion of the genome which contains most of the genes and ignore the highly repetitive heterochromatic portions (concentrated in the centromeric and telomeric regions of chromosomes). A notable exception is D. melanogaster where an effort to sequence all the heterochromatic regions has been underway since 2002. Although the technology used to obtain the nucleotide sequence of specific areas of the genome has been available since the early 1970s it was the advent of whole-genome-sequencing technology in the 1990s that made it possible to determine the millions of base pair sequences that make up a typical insect genome (Table I). The modern technique of whole-genome sequencing was developed by The Institute for Genomic Research (renamed in 2007 as the J. Craig Venter Institute, JCVI). This technique relies on fragmenting the genome into small random sections, 300-1000 nucleotide long, obtaining the sequence of these sections, and finally assembling these sections into larger fragments (Fig. 1).

TABLE I

Completed and Ongoing Insect Genome Sequencing Projects

Common name Species name Publication date Sequenced genome Online database
size [millions of
nucleotide bases]
Fruit fly Drosophila melanogaster 2000 123 FlyBase (http://flybase.bio.indiana.edu/)
Drosophila pseudoobscura 2005 139
Drosophila virilis 2007 206
Drosophila ananassae 2007 231
Drosophila mojavensis 2007 194
Drosophila erecta 2007 153
Drosophila grimshawi 2007 201
Drosophila willistoni 2007 236
Drosophila persimilis 2007 188
Drosophila sechellia 2007 167
Drosophila yakuba 2007 166
Drosophila simulans 2007 138
Mosquito Anopheles gambiae 2002 278 VectorBase ( http://www.vectorbase.org )
Aedes aegypti 2007 1,376
Culex pipiens Unpublished 540
Honeybee Apis mellifera 2006 235 HoneyBee Genome project at Baylor
College (http://www.hgsc.bcm.tmc.
edu/projects/honeybee/ )
Silk moth Bombyx mori 2004 530 Beijing Genomics Institute (http://silkworm.
genomics.org.cn/ )
Flour beetle Tribolium castaneum Unpublished 203 BeetleBase ( http://www.bioinformatics.ksu.
edu/BeetleBase/ )
Parasitic wasp Nasonia vitripennis Unpublished 250 Nasonia Genome Project
Nasonia giraulti Unpublished N/A (http://www.hgsc.bcm.tmc.
edu/projects/nasonia/ )
Nasonia longicornis Unpublished N/A
Blood-sucking bug Rhodnius prolixus Unpublished N/A N/A
Pea aphid Acyrthosiphon pisum Unpublished N/A N/A
Body louse Pediculus humanus Unpublished 106 VectorBase ( http://www.vectorbase.org )
Tsetse fly Glossina morsitans morsitans Unpublished N/A N/A

The two main advantages of this method over previous techniques are speed and relatively low cost compared to earlier methods that concentrated on sequencing specific regions on the chromosome. However, the biggest weakness of the method is the assembly of the small random sections into larger fragments (section F in Fig. 1). This assembly relies on matching nucleotide sequences with overlapping ends to each other to form larger fragments. However, when two fragments have the same end sequences (which occurs often when a fragments comes from a region of the genome where the same nucleotide sequence is repeated multiple times) it becomes difficult to determine which two fragments should be joined. This problem can be overcome to some extent by increasing the number of random fragments sequenced. Typically, enough fragments are sequenced from an insect genome so that any one nucleotide position should be present on average on at least four or five DNA fragments (referred as 4X or 5X coverage). A second method to try to overcome the assembly problems is to sequence large sections of the genome (ten of thousands nucleotide long) into “Bacterial Artificial Chromosomes” (BAC clones), that are engineered so that DNA sequence inserted into them can easily be sequenced. The resulting long sequences can then be used to guide the assembly of smaller fragments. Unfortunately,
Summary of large-scale sequencing technique used for insect genomes. Insect colonies (A) are collected and their DNA is isolated, purified, and fragmented into 300-1000 nucleotide long fragments (B). Because too little DNA is typically collected from insect samples, copies of each fragment are made by first inserting the DNA into a bacterial clone (C) and then replicating them by bacterial cell division (D). The nucleotide sequence of each clone is then determined (E) and the resulting sequences are assembled by matching overlapping ends (F).
FIGURE 1 Summary of large-scale sequencing technique used for insect genomes. Insect colonies (A) are collected and their DNA is isolated, purified, and fragmented into 300-1000 nucleotide long fragments (B). Because too little DNA is typically collected from insect samples, copies of each fragment are made by first inserting the DNA into a bacterial clone (C) and then replicating them by bacterial cell division (D). The nucleotide sequence of each clone is then determined (E) and the resulting sequences are assembled by matching overlapping ends (F).
even with both the high numbers of random fragments and BAC clones, assembly problems remain significant for insect genomes beyond Drosophila and Anopheles gambiae due to the large amounts of repeated sequences and, in many cases, large overall genome sizes. The result is that although some insect genomes have been assembled to the level of chromosomes (D. melanogaster and A. gambiae), most other insect genomes are still too fragmented to be put together into large contiguous pieces. For example, the yellow fever mosquito Aedes aegypti, which is riddled with high numbers of repeated fragments, is (as of this writing) still a collection 1257 fragments (called scaffolds) that have yet to all be placed on the three chromosomes of this important vector species.
It is evident from the sheer number of fragments to be matched that the work of genome assembly is only possible with the aid of specialized computer software. Assembly software such as ARACHNE is publicly available, but the memory and processing time required to assemble an entire insect genome is beyond the capability of today’s typical desktop computers. Instead, assembly of whole genomes is typically left to institutions with the computing capacity to process large amounts of data quickly. Often, the sequencing and genome assembly will be performed in the same facility, such as JCVI or the Broad Institute. However, smaller subsections of a genome may be assembled by individual researchers. This can be an important check for the individual researcher to check that the section of the genome he or she is concerned with was correctly assembled. Both genome assembly and sequencing technology are rapidly changing fields. The genome-sequencing outline we have provided here is applicable to all the currently sequenced genomes, but advances such as the rapid genome-sequencing technique developed by the 454 Life Sciences company will undoubtedly make its way into the next generation of insect genome-sequencing projects.

Obtaining Additional Genome Data Sets as a Part of a Genome Project

In addition to the nucleotide sequence other data sets are often also collected subsequent to ordering the whole-genome sequencing. One of the most commonly collected data sets are expressed sequence tags (ESTs), that are used to inform which parts of the genome may be involved in coding for genes. ESTs are short nucleotide sequence fragments (500-800 nucleotides long) derived from cloned messenger RNAs (mRNAs). Because they are derived from mRNAs, EST sequences represent portions of the genome that are transcribed by RNA polymerase and thus could potentially be translated into a protein product. Only a small percentage of the transcriptome (the collection of all mRNA sequences from a genome) is translated into protein. Nevertheless, EST data sets are essential to accurately map the location of genes on the genomes (detailed below). EST data sets are relatively inexpensive to generate and are typically available for each insect genome either on the online database web site, or on the larger EST database maintained by the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/dbEST/). Because not all genes are expressed continuously during development or under the same environmental conditions, mRNAs are typically collected from insects at various life stages (e.g., larva, pupae, and adult), from different environmental conditions (e.g., mosquitoes before and after a blood meal), and even from individual tissues (e.g., midgut tissue from the silk worm moth or mosquitoes). Because of their low cost, small EST data sets are often available for an insect before the genomic sequence, but larger and more representative data sets are collected in conjunction with a whole genome-sequencing effort.
In addition to ESTs, insect mitochondrial genomes are often usually assembled as well, since they can easily be determined from the whole-genome-sequencing output. Sometimes additional sequences are also collected which are specific to an organism. For example, the strain of A. gambiae chosen for whole-genome sequencing was found to be a mixture of two molecular forms of this species. Additional sequencing of the two forms (called the M and S form) is currently underway to tease these sequences apart.

MAPPING THE LOCATION OF GENETIC

FEATURES ON AN INSECT GENOME

Having only the nucleotide sequence of an insect’s genome is of very limited value to most researchers. What is required is a ” map ” of important features of a genome such as the location of genes, controlling elements, transposable elements, etc. Details of how all these features are mapped are beyond the scope of this article, but we provide a rough outline here. There are typically two phases to generating such maps, first computer algorithms are used to identify genome features (this phase is sometimes called a ” gene build ” ), second the results of the computer search are reviewed (typically by researchers with expertise in a particular set of genes or genome features) and refined.
A number of computer algorithms have been developed for the first mapping phase (the “gene build”). These are incorporated into programs such as RECON, REPEATSCOUT, and others which try to assemble into a single processing queue the knowledge of previously determined gene sequences, gene models, transposable element sequences, information from EST data sets, and other kinds of data. Taking all this into account, these programs try to make informed predictions about the location on the genome nucleotide sequence of all the mapable features. The variety of different algorithms available for this task stems from the fact that no one algorithm is best suited for predicting the location of the wide variety of genomic features. For example, the output of one program may be conservative and accurately predict the location of some genes, but at the expense of completely missing others, while another program may be too liberal, finding the missing genes but also reporting false positives (i.e., mistakenly identifying a gene in a location where there is none). Because of this, it has become common practice to use several algorithms for the same task and then to merge the results into a single set of features (termed a “gene merge”); this is the method that was used for all three mosquito genomes. The generation of “gene builds” is very computer intensive and not within the scope of individual researchers. Instead, these are often conducted by specialized teams at genome-sequencing centers or by other specialized groups (such as the European Bioinformatics Institute).
After the initial mapping by computer algorithms, the second mapping phase begins. In this phase, researchers with expertise in a particular gene family or in features such as transposable elements will review the computer-generated results and improve upon them. For some genomes (such as Drosophila, and the mosquito genomes) ” community representatives” are assigned to review any suggested modifications of a genome annotation and to ensure that the proposed changes become publicly available after they have been approved. While this system ensures that anyone can contribute to the annotation of a genome, the vast majority of genes in a typical insect genome will not be reviewed by experts. A notable exception is the genome of D. melanogaster where a comprehensive effort has been undertaken to manually review each gene, with a similar effort currently underway for the three mosquito genomes as well.
The results of the computer-generated and human-curated annotations are displayed in one of two basic formats, and are publicly available over the internet. At the NCBI website, a separate record for each gene (or any other genome feature) is generated providing all the available information in text form, in the equivalent of an electronic card catalog. Other sites, such as ENSEMBL (http://www. ensembl.org), can display the information on a graphical interface, showing the location of individual features on the genome as well as nearby features. Often more than one feature will be mapped to the same physical location on the genome. This may come about for several reasons. First, it may represent alternate splicing of the same gene. Second, if a change has been done to a feature, the old version will typically not be removed. Third, some features do overlap in nature, such as a transposable element sitting inside a gene intron. The resulting array of annotations often yields a genome map that may be confusing at first glance, but that reflects some of the real complexity of an insect genome.

Drosophila Genomics

D. melanogaster is by far the most studied and best understood of all available insect genomes. Even though this insect has been at the forefront of genetic research since the days of Thomas Hunt Morgan at the start of the last century, less than 20% of its genes had been identified prior to the publication of its genome in the year 2000. Many specialized databases exist to assemble genomic data derived from Drosophila . The two most-important ones are FlyBase (http:// flybase.bio.indiana.edu/) and the Berkeley Drosophila Genome Project (http://www.fruitfly.org/). These data repositories attempt to assemble, in one place, all of the most-important fruit fly genomics related information. The accumulation of such vast amounts of data has made possible the development of several new fields of research. We outline two of them here. Neither of these fields is exclusive to insect genomics, but both fields have seen major recent developments using the fruit fly genome.

FUNCTIONAL GENOMICS

The aim of functional genomics is to describe and understand the pattern of gene expression. It has long been established that gene expression is typically influenced by the expression of not one, but many, genes. Before the entire set of Drosophila genes became known, study of gene expression relied heavily on the generation of fruit fly mutants for specific genes and on deducting the function of these genes based on observation of physical or behavioral differences compared to wild-type flies. While generating mutant Drosophila flies is far easier than for any other insect (mainly because the methods have been so thoroughly developed for the flies) it remains a time consuming and labor intensive task. Furthermore, some mutations may not be possible to observe as they may not yield any observable change or the changes may be lethal, killing the fly before the mutation can be observed. The sequencing of the Drosophila genome revolutionized the study of insect gene expression by making microarray assays possible for the entire set of Drosophila genes.
First developed in the late 1980s but significantly improved in the 1990s, the purpose of DNA microarray technology is to examine whether a particular gene region is transcribed under specific environmental conditions, and if it is, how high or low the level of transcription is. Whether a genetic region is transcribed or not (as well as the level of transcription) is taken as an indication that the gene may be translated into a protein product (it is important to remember that only a fraction of all the mRNAs are eventually translated).
The technology behind DNA microarrays is similar to that of standard nucleic acid hybridizations, but can be accomplished on a much bigger scale with the introduction of “gene chips.” Gene chips are solid surfaces, onto which a series of nucleotide sequences have been bonded. The physical location and nature of each bonded nucleotide sequence (a bonded nucleotide sequences is called a “probe”) is recorded. Probes are typically arranged in a grid pattern. There is no constraint on the type of nucleotide sequence that can make a probe, but some of the most-used gene chips are those where each probe contains the entire sequence (or at least a substantial portion) of a single gene. In this way, thousands of different gene sequences can be placed on one or more gene chips, making it possible to have all the known Drosophila genes (over 13,500 genes) available in a single assay. Such genome-wide gene chips are commercially available not only for Drosophila but for the malaria mosquito (A. gambiae), and there are plans to make a similar chip available for the yellow fever mosquito (A. aegypti) and the southern house mosquito (Culex pipi-ens quinquefasciatus).
To detect expression using a gene chip, a solution of cDNA (DNA sequence reverse transcribed from mRNA is preferred for this assay because it is much far stable than RNA which can degrade rapidly) labeled with a fluorescent dye is washed over the probes. Both the cDNA and probes are denatured (i.e., the two DNA strands of the double helix are separated from one another) during the assay. The assay is conducted under such conditions that if the cDNA and probes have complementary sequences they will become attached to one another reforming the double helix, a process referred to in this context as hybridization. After hybridization, any unattached cDNA is washed away. The chip is then illuminated with ultraviolet light, which will cause the dye attached to the hybridized cDNAs to fluo-resce, and become visible. Under the right assay conditions, not only the presence but the amount of hybridized DNA can be measured. The end result is a grid with probes that are either dark (no hybridization, so these genes did not have matching cDNAs in the assay) or illuminated probes with various levels of intensity proportional to the amount of matching cDNA in the assay. Different colored dyes enable genes that are up- or down-regulated to be identified.
An example of how this type of assay has significantly improved our understanding of gene function can be found in the recent studies of the Drosophila circadian rhythms. Ueda and colleagues reared flies under different intensity light conditions and collected sample flies at specific time intervals. From each time, 100 fly heads were collected and total RNA was extracted. After creating cDNAs based on these RNA extractions, the cDNA extracts were washed over a gene chip containing the entire set of Drosophila genes. By analyzing which genes were transcribed in different samples, 712 genes were found to have fluctuating mRNA levels under varying light conditions, leading these researchers to significantly expand the number of genes known to be involved in Drosophila circadian rhythms. To further elucidate which of these candidate genes were most directly involved in regulating circadian rhythm, this same group subsequently applied a second technique, RNA interference (RNAi). RNAi is a laboratory technique developed to reduce the expression of targeted genes through the generation of double-stranded RNAs. By modifying standard RNAi techniques they were able to significantly reduce the expression of 133 genes in a tissue-specific manner. Based on these results, Matsumoto et al. were able to identify five new genes directly involved in Drosophila circadian rhythm regulation. As this example demonstrates, advances in existing fields of research can occur rapidly with the introduction of genomic techniques.

COMPARATIVE GENOMICS

Although studies of the Drosophila genome on its own have yielded significant contributions to our understanding of this model organism, it is the comparison of one genome to another that may hold the greatest promise for significant advances in the years to come. Comparative genomics is the study of relationships between genomes, be they different species or strains of the same species. Prior to whole-genome sequencing, this field was severely limited both by the types of data available for comparison and by the low number of completed genome projects. While the number of insect genomes with published whole genome sequences is still infinitesi-mally small in comparison to the number of described insect species, the amount of information available for comparison between these genomes is now sufficient to enable meaningful comparisons across species to be made at the genomic level. Furthermore, comparative genomics is not limited to studies within insects, but can be very productively applied to comparisons of distantly-related organisms. For example, it has become clear that a great number of genes shared by fruit flies and humans are not only homologous (i.e. derived from the same ancestral gene) but perform similar functions in both organisms. Studies have found that a number of human genetic disease loci have homologs in Drosophila showing similar patterns of expression. This observation has made possible the use of Drosophila as model organism for the study of certain human diseases. We will not go any further into the advances made by fruit fly to human comparisons but this example serves to demonstrate the great potential of comparative genomics.
One of the most-exciting developments in comparative genom-ics is the recent publication of 12 entire Drosophila genomes. For researchers interested in the mechanisms of evolution in animals this provides a unique opportunity to observe which genes (or genome regions in general) have evolved between species and which have remained unchanged. For example, these 12 genomes show a remarkable amount of variation in the both the types and amount of transposable elements (small sections of DNA sequences that can catalyze their own movement to different locations the genome) between species. This observation lends some support to the hypothesis that transposable elements may have a role to play in insect spe-ciation mechanisms. Comparison of these genomes also shows that much Drosophila speciation is associated with chromosomal inversions (i.e., rearrangement of chromosomal segments). What makes this data set such a powerful tool for the study of evolutionary mechanisms is that the ancestral relationship between these Drosophila species (the phylogeny) is well established, allowing researchers not only to observe what change has occurred between species, but in what sequence (i.e., what the ancestral and derived states are), and even to be able to put time boundaries on when some changes occurred, allowing for an estimate of the absolute time it took to undergo this change. This is a level of detail previously reserved for bacterial and yeast genomes. The success of this approach has lead to calls for the expansion of it to other insects with published genomes, such as mosquitoes.

MOSQUITO GENOMICS

As vectors of some of the deadliest human diseases mosquitoes are the second-most-studied insects using genomic tools after fruit flies. One of the challenges of mosquito genomes is their large genome size compared to Drosophila. For example, the A. aegypti genome is more than ten times the size of D. melanogaster (Table I). This increase in size is not correlated with a similar increase in gene
number, but appears to be mainly due to an increase in the number of repeat elements in the mosquito genome. The reason why so many more repeated elements accumulate in mosquito genomes is a question of active research. Although fruit flies are of little economic importance and are studied mainly as “model” insects, much of the focus of mosquito genome research is centered on finding ways to prevent human disease transmission. Because of this, functional genomic studies of mosquitoes have tended to focus on gene families that could be targets for vector control strategies, such as immunity genes, or genes involved in host-seeking behavior (e.g. genes involved in olfaction).
A. gambiae the main vector of malaria in sub-Saharan Africa was the first mosquito to have its entire genome sequenced. An immediate benefit of the genome sequence was the dramatic increase in potential microsatellite markers. Microsatellites are used by population geneticists to study gene flow between field mosquito populations to understand patterns of malaria transmission in the field. A number of potential vector control genes have been identified, including 79 odorant receptor genes, and 76 gustatory receptors genes. Comparative genomic studies demonstrated that the majority of immunity genes are similar in function between Drosophila and A. gambiae. The combination of such studies with RNAi techniques (discussed above) has been very fruitful in elucidating the pathways by which these immunity genes function in this mosquito. Another area of exciting research in mosquito genomics, was the sequencing of Plasmodium falciparum the unicellular protist that can lead to the development of human malaria when it is transmitted through a mosquito host. Much of the research involving the Plasmodium and Anopheles genomes has centered on trying to identify which genes are involved in trying to suppress Plasmodium infection by the mosquito, and how Plasmodium is able to evade the mosquito immune response. While the other two mosquito genomes, A. aegypti (the primary vector of dengue and yellow fever) and C. pipiens (a vector of the West Nile virus) have not yet been examined in as much detail as A. gambiae, the opportunity to compare genetic difference and similarities on a genome scale between the three mosquito species promises to be very fruitful in the search to understanding vector transmission by mosquitoes.

CONCLUSIONS

The field of insect genomics has revolutionized many existing areas of insect molecular biology and genetics. The ability to identify many candidate genes for a specific function by analysis of an insect’s genome has significantly increased the speed with which genes have been linked to specific functions. The ability to compare multiple genes across multiple genomes has, and will continue to, yield insights into the way genes and genomes evolve. The development of microarrays has led to the elucidation of gene function at much higher rates than previously possible. These are advances that would have taken many additional years of research using conventional genetic techniques. In this article, we have covered only DNA sequences but proteomics (the large-scale study of protein structures and function) have also seen advances using insect models. However, this wealth of information also comes at a price. So much data have been generated that it has become increasingly difficult to appropriately analyze it. The difficulty is not the availability of the data as most of the databases are open to all researchers and even to nonacademics. Instead, the problem is to find which of the myriad databases contains the required information, and once found, being able to retrieve it. Several publicly accessible databases attempt to centralize genomic data, the best known and largest of which is Genbank. However, retrieving large amounts of genomic data can be a time-consuming effort, particularly if the data is spread across multiple databases. In an effort to standardize the retrieval of genomic data, many databases have adopted the Generic Model Organism Database (GMOD) model, a standard for classification of almost any biological information. Databases that have adopted this model benefit from a large collection of open source software that can be used to retrieve stored data. Until the advent of whole-genome sequencing, molecular biology and computer science rarely overlapped. Looking into the future, it appears likely that the next generation of molecular biology researchers will have to be as comfortable manipulating database queries as they are at using micropipetters at the laboratory bench.

Next post:

Previous post: