Insect Genomics Part 3

Global gene expression analysis (transcriptome analysis)

DNA microarray fabrication. The DNA microarrays used for global gene expression analysis usually contain tens of thousands of probes which cover all the predicted genes in a genome, or sequences representing transcribed regions, also called expressed sequence tags (ESTs). For example, the Affymetrix GeneChip®

Table 2 List of Applications of DNA Microarray

Application	Description	Type of microarray
Gene expression	Measuring global gene expression pattern under various biological conditions	Expression array
ChIP-on-chip	Identifying transcriptional or functional elements at a whole-genome level	Tiling array
DamID	Genome-wide scanning of Adenosine methylation events. Analogously to ChIP-on-chip	DNA methylation array
miRNA profiling	Genome-wide detection of the expression of miRNAs (small non-coding RNAs)	miRNA array
SNP detection	Detecting polymorphisms within a population	SNP array
Pathogen and virus detection	Low-density DNA microarray for the identification of viruses and pathogens	Virus Chip, FluChip

Drosophila Genome 2.0 Array contains over 500,000 data points representing 18,500 transcripts and various SNPs (Affymetrix technical data sheets). DNA microarrays can be prepared by various methods, including photolithography, ink-jet technology, and spotted array technology. Photolithography and ink-jet technologies are used for fabricating so-called oligonucleotide microarrays, which are made by synthesizing or printing short oligonucle-otide sequences (25-mer in Affymetrix array or 60-mer in Agilent array) directly onto a solid array surface. The photolithography method is used by Affymetrix and Nim-bleGen, while the ink-jet print method is used by Agilent. Typically, multiple probes per gene are used in order to achieve precise estimation of gene expression. Long oli-gonucleotides have better hybridization specificities than short ones, although short oligonucleotides can be printed at a higher density and synthesized at lower cost. In contrast, spotted microarrays are made by synthesizing probes prior to deposition onto the array surface. The probes used for spotted microarrays can be oligonucleotides, cDNA or PCR products. Because of their relatively low cost and flexibility, the spotted microarray technology has been widely used to produce custom arrays in many academic laboratories and facilities. However, spotted microarrays are less uniform and contain low probe density when compared with oligonucleotide arrays. As the cost of custom commercial arrays such as Agilent Custom Gene Expression Microarrays (eArray) has decreased, the use of spotted microarray is decreasing as well.

Target preparation and hybridization.Total RNA or mRNA is isolated from experimental samples using commercial TRIzol reagent or RNA isolation and purification kits. Total RNA (1 \ig to 15 ng) or mRNA (0.2 \ig to 2 ng) is reverse transcribed into first-strand cDNA. For smaller amounts of total starting RNA (10 ng to 100 ng), Affymetrix offers a two-cycle target labeling method to obtain sufficient amounts of labeled targets for DNA hybridization. Then, cDNAs are labeled and hybridized to spotted or oligonucleotide microarrays. In oligonucleotide microarrays, one mRNA sample labeled with one fluorescent dye is analyzed on a single channel. Alternatively, two different fluorescent dyes, such as Cy3 and Cy5, can be used to determine gene expression changes from two different experimental conditions.

Data analysis. Although the data analysis methods among commercial microarrays vary, the basic concepts are similar. After hybridization, the fluorescence images are captured by a microarray scanner. The fluorescence intensity data are then corrected and adjusted from the background (noise), which may result from nonspecific hybridization or autofluorescence. In two-channel arrays, the fluorescence intensity ratio between two dyes is calculated and adjusted. If the data from a different array or hybridization are to be compared, they need to be normalized before further analysis.

After normalization, various statistical analysis methods can be applied to identify differentially expressed genes between two treatments. Usually, a t-test is used for comparing the means of two sample populations, while ANOVA (analysis of variance) is applied for comparing multiple sets of samples or treatments to obtain more accurate variance estimates. Since many genes are tested for statistical differences, multiple test corrections, such as the Bonferroni correction and the Benjamini and Hochberg false discovery rate (FDR) (Benjamini and Hochberg, 1995), are applied to adjust the P-value and correct the occurrence of false positives. Bonferroni correction is a very stringent method that uses a/n as the threshold P-value for each test where n is the number of tests or the number of genes. In contrast, the Benjamini and Hochberg FDR is less stringent, and the rate of false negative discovery is lower. Various statistical analysis programs are now available from either commercial micro-array providers or open source websites. These include GeneSpring from Silicon Genetics (acquired by Agilent in 2004) and Significance Analysis of Microarrays (SAM) (Tusher et al., 2001). Besides differential expression analysis, genes with similar expression patterns can be grouped into one or more clusters using hierarchical clustering methods. Hierarchical clustering analysis helps to visualize gene expression patterns and identify relationships between functionally associated genes (Eisen et al., 1998). On the other hand, programs such as Gene Set Enrichment Analysis (GSEA) are used to determine whether there is a statistically significant, coordinated difference between control and treatment samples for a predefined set of genes that are involved in a similar biological process (Subramanian et al., 2005). Unlike traditional microarray analyses at the single gene level, GSEA has addressed a situation where the fold change between control and treatment samples is small, but there is a concordant difference in the representation of functionally related genes. Several published microarray datasets have been deposited in various online databases, including Gene Expression Omnibus (GEO) at NCBI, ArrayExpress at the European Bioinformatics Institute, and Stanford Genomic Resource at Stanford University. A list of microarray analysis tools and databases is shown in Table 3.

Applications. The primary goal of developing gene expression microarray technology is to monitor differentially expressed genes at the whole-genome level. Therefore, microarray technology has been used to study the molecular basis of pesticide resistance (Djouaka et al., 2008; Zhu et al., 2010) (Figure 3), insect-plant interactions (Held et al., 2004), insect host-parasitoid associations (Lawniczak and Begun, 2004; Barat-Houari et al., 2006; Mahadav et al., 2008; Kankare et al., 2010), insect behavior (McDonald and Rosbash, 2001; Etter and Ramaswami, 2002; Dierick and Greenspan, 2006; Adams et al., 2008; Kocher et al., 2008), development and reproduction (White et al., 1999; Kawasaki et al., 2004; Dana et al., 2005; Kijimoto et al., 2009; Bai and Palli, 2010; Parthasarathy et al., 2010a, 2010b), etc.

Table 3 List of Microarray Data Analysis Tools and Microarray Databases

Statistical Analysis Programs
GeneSpring	http://www.agilent.com/
SAM	http://www-stat.stanford.edu/~tibs/SAM/
Bioconductor	http://www.bioconductor.org/
Partek	http://www.partek.com/
Cluster and Pathway Analysis Tools
Cluster and TreeView	http://rana.lbl.gov/EisenSoftware.htm
Cluster 3.0	http://bonsai.hgc.jp/~mdehoon/software/cluster/
Java TreeView	http://jtreeview.sourceforge.net/
Gene Set Enrichment Analysis (GSEA)	www.broadinstitute.org/gsea/
Gene Set Analysis (GSA)	http://www-stat.stanford.edu/~tibs/GSA/
Genepattern	http://www.broadinstitute.org/cancer/software/genepattern/
Genecruiser	http://genecruiser.broadinstitute.org/genecruiser3/
Advanced Pathway Painter	http://pathway.painter.gsa-online.de/
Microarray Databases
Gene Expression Omnibus	http://www.ncbi.nlm.nih.gov/geo/
ArrayExpress Archive	http://www.ebi.ac.uk/microarray-as/ae/
Stanford Genomic Resources	http://genome-www.stanford.edu/
Arraytrack	http://www.fda.gov/ScienceResearch/BioinformaticsTools/Arraytrack/
Genevestigator	https://www.genevestigator.com/gv/index.jsp

Understanding the mechanisms of pesticide resistance is critical for prolonging the life of existing insecticides, designing novel pest control reagents, and improving control strategies. As a result, several laboratories have begun using microar-rays to identify genes responsible for insecticide resistance. For example, using a custom microarray, one cytochrome P450 gene, CYP6BQ9, has been identified to be responsible for the majority of deltamethrin resistance in T. castaneum (Zhu et al., 2010) (Figure 3). Another micro-array study discovered that two cytochrome P450 genes, CYP6P3 and CYP6M2, are upregulated in multiple pyre-throid-resistant Anopheles gambiae populations collected in Southern Benin and Nigeria (Djouaka et al., 2008). A global view of tissue-specific gene expression profiling has been reported in Drosophila melanogaster (Chintapalli et al., 2007). This study identified many genes that are uniquely expressed in specific fly tissues, and provided useful information for understanding the tissue-specific functions of these candidate genes.

Biological processes and cellular functions are rarely regulated by only one or a few genes. Therefore, monitoring the expression changes of a group of genes under different biological conditions could provide useful insights into biological processes and cellular functions. Microar-rays have been applied to detect gene expression patterns during insect embryonic development (Furlong et al., 2001; Stathopoulos et al., 2002; Tomancak et al., 2002; Altenhein et al., 2006; Sandmann et al., 2007) and metamorphosis (White et al., 1999; Butler et al., 2003), under various nutrient conditions (Zinke et al., 2002; Fujikawa et al., 2009), with aging (Weindruch et al., 2001; Pletcher et al., 2002; Terry et al., 2006; Pan et al., 2007), and in many other circumstances.

In combination with newly developed statistical and bioinformatics methods, and gene ontology and signaling pathway databases, microarray technology has also been applied to identify a signaling pathway or a specific cellular function that is altered under various biological conditions (Subramanian et al., 2005). With these approaches, it is possible to discover the interactions between individual pathways and obtain a global network view (Costello et al., 2009; Avet-Rochex et al., 2010).

DNA-protein interaction (chromatin immuno-precipitation) Chromatin immunoprecipitation (ChIP) was developed in the late 1980s (Hebbes et al., 1988) and has been widely applied to the study of protein— DNA interactions in vivo. Particularly, transcription factors, histone modifications, and DNA replication-related proteins can be studied using ChIP. By combining ChIP with DNA microarray technology, a process typically called ChIP-on-chip, all the possible DNA-binding sites of a protein of interest throughout the genome can be examined. ChIP-on-chip technology first appeared in 2000 in studies of DNA-binding proteins in the budding yeast, Saccharomyces cerevisiae (Ren et al., 2000; Iyer et al., 2001). With the availability of high-density oligonucleotide arrays which contain short sequences representing non-coding regions or entire genomes, ChIP-on-chip has also been applied to the global identification of transcriptional regulatory networks in various organisms.

Figure 3 Application of microarray and RNA interference technologies to identify and fight insecticide resistance.

(A) The V plot of differentially expressed genes identified by microarrays. Fold suppression or overexpression of genes in QTC279 strain when compared with their levels in the Lab-S strain was plotted against the P values of the t-test. The horizontal bar in the plot shows the nominal significant level 0.001. The vertical bars separate the genes that are a minimum of 2.0-fold difference. Three genes identified by the Bonferroni multiple-testing correction as differentially expressed between resistant and susceptible strains are shown.

(B) Injection of CYP6BBQ9 dsRNA into Tribolium castaneum QTC279 beetles reduces CYP6BBQ9 mRNA levels. The mRNA levels of CYP6BQ9 were quantified by qRT-PCR at 5 days after dsRNA injection. The relative mRNA levels were shown as a ratio in comparison with the levels of rp49 mRNA.

(C) Dose-response curves for T. castaneum adults exposed to deltamethrin. At 5 days after dsRNA injection, the following were exposed to various doses of deltamethrin: Lab-S (O), a susceptible strain; QTC279 (V), a deltamethrin-resistant strain; QTC279-CYP6BQ9 RNAi (•), a QTC279 strain injected with CYP6BQ9 dsRNA; and QTC279-malE RNAi (▼), a QTC279 strain injected with malE dsRNA as a control.

These projects include ENCODE (human) (The ENCODE Project Consortium 2004) and modEN-CODE (worm and fly) (Celniker et al., 2009). The goal of these projects is the genome-wide characterization of all possible functional elements using ChIP-on-chip and other high-throughput technologies. ChIP-on-chip technology will likely contribute to a better understanding of genome organization, including functionally important elements, non-coding RNA, and chromatin markers. This may eventually lead to the comprehensive understanding of gene regulatory networks within an organism’s genome.

Many ChIP-on-chip protocols have been published, or are available online. In general, cells or tissues are treated using a reversible cross-linker (e.g., formaldehyde), so that protein and DNA are fixed in vivo. Then the protein-DNA complex within the nucleus is extracted and separated from cytoplasm. Purified protein-DNA complexes (referred to as "chromatin" hereafter) are sonicated using a conventional sonicator or Bioruptor® in order to generate DNA fragments that range from 200 to 1000 bp. The sonication conditions need to be pre-adjusted to obtain optimally sized DNA fragments. Before sonication, an aliquot of chromatin needs to be saved as a reference sample (or input samples). Usually a chromatin pre-clean step using protein-A beads is included to remove non-specific binding during the immunoprecipitation step. For the immunoprecipitation step, a certain amount (e.g., 10 |xg) of antibody and protein-A beads is added to pre-clean the chromatin. Chromatin bound to protein-A beads is then purified, eluted, and reverse-cross-linked. Since the amount of a single ChIP DNA sample is normally around a few nanograms, and this is not enough for microarray hybridization, an amplification step is required. There are two ways to amplify ChIP DNA: ligation-mediated PCR (LM-PCR) and whole-genome amplification (WGA). The WGA method is considered to have lower background compared to the LM-PCR method (O’Geen et al., 2006). Amplified ChIP DNA and Input DNA are then denatured, fluorescently labeled, and hybridized to either a spotted or a oligonucleotide microarray (typically a tiling array). If there is a known target binding site for the protein of interest, the quality of ChIP samples can be assessed using real-time qPCR before submitting the samples for microarray analysis.

The data preprocessing steps of ChIP-on-chip are similar to those used in gene expression microarrays. After microarray scanning and fluorescence intensity recording, the enrichment of each binding site across the genome is obtained by comparing the intensity of each spot between ChIP DNA and Input DNA. Enriched regions can then be further analyzed, including identification of genes associated with each binding region, and conserved motif searching. The enrichment can also be visualized using many free available genome browsers, such as UCSC Genome Browser (http://genome.ucsc.edu/), Integrated Genome Browser (IGB, http://www.bioviz.org/igb/), and Integrative Genomics Viewer (IGV, http://www.broadinstitute.org/igv/). The workflow of a chromatin immunopre-cipitation experiment is shown in Figure 4.

Antibody quality is a critical factor for successful ChIP-on-chip experiments. Since there are a variety of antibodies for a protein of interest, each with a specific affinity, it is always better to examine all the available antibodies in a small-scale ChIP-PCR experiment. If there are no suitable antibodies for a protein of interest, an epitope-tagged protein can be used (Zhang et al., 2008). In this way, an antibody for the epitope instead of one for the protein of interest can be used in immunoprecipitation. In Drosophila, transgenic flies may be generated to express epitope-tagged proteins in vivo.

The success of ChIP experiments also depends on the sonication step. It is suggested that 200- to 1000-bp DNA fragments should be obtained after sonication or DNA shearing. Undersonication will result in many large fragments (larger than 1000 bp) and lead to loss of resolution. Oversonication could interfere with the protein-DNA complex formation, and may result in more noise.

As mentioned above, the WGA amplification method is considered better than the LM-PCR method. Due to the bias caused by PCR amplification, the signal-to-noise ratio normally decreases after a PCR reaction; therefore, minimizing the number of PCR cycles is suggested. As reported by O’Geen et al. (2006), the WGA amplification method has higher signal-to-noise ratio and more enriched binding sites when compared to the LM-PCR method.

DNA-protein interaction (chromatin immuno precipitation)

Due to the availability of whole-genome sequences, the application of ChIP-on-chip technology is mainly used in model insects. ChIP-on-chip has been applied to dissecting the transcriptional regulatory network of embryogenesis (Sandmann et al., 2007; Zeitlinger et al., 2007; Liu et al., 2009), chromatin modification (Aleksey-enko et al., 2008; Smith et al., 2009; Tie et al., 2009), epigenetic silencing (Negre et al., 2006), etc.

Figure 4 The workflow of a chromatin immunoprecipitation-sequence identification experiment. After cross-linking, the chromatin is precipitated with antibodies; the precipitated chromatin is cross-linked, and the DNA purified and amplified. The amplified DNA is then sequenced and aligned to the reference genome and potential binding sites are identified.

Interestingly, a high-resolution transcriptional regulatory atlas of meso-derm development was constructed through the analysis of a key set of transcription factors, including Twist, Tinman, Myocyte enhancing factor 2, Bagpipe and Biniou, in the Drosophila embryo (Zinzen et al., 2009).

Next Generation Sequencing (NGS)

Although DNA microarray technologies are widely used in many aspects of biological and medical research, there are some limitations. The design of the microarrays is based on our current knowledge of sequenced genomes from computationally predicted raw genome structures. These structures include gene coding regions, introns, enhancers, and non-coding RNAs. Due to a lack of comprehensive knowledge on the chromosome landscape, however, these predictions may or may not be correct. Although some tiling arrays may contain high-density oli-gonucleotides covering the entire genome, they are normally not cost-effective, particularly in the case of gigantic genomes (e.g., human and many plant genomes). Most importantly, in order to perform a whole-genome analysis, a sequenced genome is an absolute requirement. This becomes a limitation for many non-model organisms that do not have whole-genome sequences.

Fortunately, the breakthrough of revolutionary sequencing technology has overcome this limitation and brought us into a new post-genomics era. Next generation sequencing (NGS), or deep sequencing, was first introduced in 2005 (Margulies et al., 2005; Shendure et al., 2005). When compared to automated Sanger sequencing (or first generation sequencing) (Sanger and Coulson, 1975), NGS technology has dramatically accelerated the sequence speed by increasing the number of sequencing reactions and reducing the reaction volume in one instrument run (Metzker, 2010). Therefore, thousands of sequencing reactions are performed simultaneously, and in some cases NGS is also referred to as massively parallel sequencing. Unlike Sanger sequencing, the incorporation events of fluorescently labeled nucleotides to DNA templates are almost continuously monitored and recorded. More than 100 million short reads (ranging from 35 bp to 300 bp) can be obtained using some NGS technologies. Several NGS platforms, including Roche/454 Life Sciences’ GS FLX, Illumina’s Solexa GAII, and ABI’s SOLiD, are commercially available. Each platform has its own sequencing methods and unique features (see Table 4). An overview of NGS technology and various sequencing platforms can be found in a recent review (Metzker, 2010). Here, we will focus on recent applications of NGS technologies in gene expression and ChIP studies.

RNA-Seq RNA-sequencing (RNA-Seq) uses NGS technology for transcriptome analysis. In contrast to conventional microarray analysis, RNA-Seq provides much more information, including unpredicted novel transcripts and previously unknown alternatively spliced isoforms. Like other NGS technologies, a cDNA library has to be made from RNA samples by adding adaptor sequences to one or both ends of cDNA. Then, long RNA or cDNA samples need to be fragmented. Small fragments (usually 150-300 bp) are separated by electrophoresis, isolated using the gel extraction method, and then purified for sequencing. After sequencing, which may take from a single day to a week, depending on the platform used, the sequence reads are then aligned to a reference genome, or used for de novo assembly if no genome information is available.

Due to the tremendous amount of sequencing data obtained after each sequencing run, there are always challenges in data handling and statistical analysis. Several bioinformatics programs, such as ELAND (by Illumina), SOAP (Li et al, 2008a), and BOWTIE (Langmead et al, 2009), have been developed for mapping the reads to a reference genome. Typically, reads with a single match to the genome sequence will be selected for future analysis. Reads with more than three mismatches, or reads that match to multiple regions of the genome, will be discarded. The mismatches may be due to sequencing errors, polymorphisms, poor sequencing quality, or low expression abundance. The reads can be found within exon regions, exon junctions, and the regions near poly (A)-tails. The expression level for each gene then can be determined by the enrichment of reads across entire ORFs (open reading frames). Like other NGS technologies, RNA-Seq has many advantages over expression microarray analysis. RNA-Seq has very low background, and is cost-effective. It also has better sensitivity to detect genes with very low or high expression levels. Most importantly, RNA-Seq is useful to detect novel and rare transcripts and alternatively spliced transcripts. It also offers great opportunities for the de novo transcriptome analysis of non-model organisms.

RNA-Seq technology has been used in a transcriptome analysis of Aedes aegypti in response to pollutants and insecticides (David et al., 2010). A Drosophila melanogaster 5′-end mRNA transcription database was constructed through RNA-Seq technology, and contains expression profiles of each fly gene at various developmental stages (http://machibase.gLk.u-tokyo.ac.jp/ [Ahsan et al., 2009]). Roche/454 based pyrosequencing has been widely used to sequence the transcriptome of non-model insects, such as the Glanville fritillary butterfly (Vera et al, 2008).

Table 4 List of Next-Generation Platforms

Platform	Manufactory	Sequencing method	Feature
GS FLX	Roche/454 Life Sciences	Pyrosequencing	Long reads (300-400 bp); fast run time.
Solexa GAII	Illumina	Reversible termination	Short reads (35 or 70 bp); huge reads per run (~20 GB)
SOLiD	ABI	Sequence by ligation	Short reads (~50 bp); huge reads per run (similar to Solexa)
HeliScope	Helicos BioSciences	Reversible termination; single molecule sequencing	No bias introduced from library construction

ChlP-Seq Chromatin immunoprecipitation sequencing (ChIP-Seq) is sequencing-based genome-wide mapping of protein-DNA interactions. Similar to the ChIP-on-chip technology mentioned earlier, ChIP-Seq also involves the pull-down of DNA fragments (ChIP DNA) bound by a protein of interest. Instead of hybridizing ChIP DNA to an oligonucleotide microarray, a sequencing library is constructed by adding adaptor sequences to ChIP DNAs, followed by size selection and gel purification. After submitting the library to sequencing, ChIP-Seq raw data are generated, which may contain more than 100 million short reads. These reads will then be aligned to a reference genome, and high quality reads that have a good match to a single genomic region (one to two nucleotide mismatches are allowed) selected. Normally, 60-80% of the total reads can be aligned to a reference genome. The enrichment regions (binding sites) can be obtained by comparing the reads between ChIP DNA and control DNA (e.g., Input or mock DNA samples) in a process called peak calling. Various bioinformatics tools are available for performing peak calling, including PeakSeq (Rozowsky et al., 2009), QuEST (Valouev et al., 2008), CisGenome (Jiang et al., 2010), and Galaxy (Giardine et al., 2005). Finally, the enriched regions (or peaks) can be visualized using genome browsers, as mentioned previously.

ChIP-Seq technology offers many advantages over ChIP-on-chip. The single nucleotide resolution of ChIP-seq data is much higher than that of ChIP-on-chip. Therefore, binding motif analysis is simplified. ChIP-Seq technology also provides more information on protein-DNA interactions, and better genome coverage. Since there is no hybridization step involved, ChIP-Seq normally has less background noise, and can detect a dynamic range of binding events. In contrast, ChIP-on-chip technology has difficulty in distinguishing very low or very high binding events. With technological advancements, ChIP-Seq technology will become less costly for analyzing most genomes. ChIP-Seq has been used in characterizing MSL-complex regulatory networks in the X-chromosome of D. melanogaster (Alekseyenko et al., 2008), as well as in a genome-wide methylome study of the silkworm, Bom-byx mori (Xiang et al., 2010). Once the cost of ChIP-Seq declines to prices comparable to ChIP-on-chip, there will be more ChIP-Seq applications in insect research.

Other Methods

In addition to mRNA, there are many non-coding RNAs (ncRNAs) within a genome. These include highly abundant and functionally relevant RNAs such as transfer RNA, ribosomal RNA, microRNAs, and long intergenic non-coding RNAs. Combining functional analysis and high-throughput microarrays or sequencing technologies has allowed the identification and characterization of novel non-coding RNAs (ncRNAs). Many ncRNAs, particularly microRNAs, have been found to be involved in development (Zhang et al., 2009), neurodegenera-tion (Karres et al., 2007), cell proliferation (Thompson and Cohen, 2006), circadian rhythms (Yang et al., 2008), and host-parasitoid interactions (Gundersen-Rindal and Pedroni, 2010).

High-throughput microarray or sequencing technologies have also been applied to studies on metagenomics, or the study of genetic material recovered from environmental samples (e.g., microflora of the ocean, soil or insect gut). With the help of Roche/454 pyrosequencing technology, the Israeli acute paralysis virus was recently identified, and found to be associated with colony collapse disorder (CCD) in honey bees (Cox-Foster et al., 2007). A large set of bacterial genes with cellulose and xylan hydrolysis functions was identified using pyrosequencing from the hindgut of a wood-feeding higher termite that is closely related to Nasutitermes ephratae (Warnecke et al., 2007).

Insect Genomics Part 3

Global gene expression analysis (transcriptome analysis)

DNA-protein interaction (chromatin immuno precipitation)

Next Generation Sequencing (NGS)

Other Methods

Related Links

:: Search WWH ::