SAGE (Genomics)

1. Introduction

Serial analysis of gene expression (SAGE) shares much in common with other digital gene expression technologies such as expressed sequence tag (EST) (see Article 78, What is an EST?, Volume 4) sequencing and massively parallel signature sequencing (MPSS). These techniques all aim to comprehensively profile gene expression by obtaining short stretches of DNA sequence from randomly selected cDNAs in a sample of interest. As when conducting an opinion poll, the more cDNAs that are sampled, the more accurate and comprehensive is the resulting expression profile. SAGE distinguishes itself from other technologies by the cloning and sequencing of multiple concatenated sequence tags – short DNA sequences obtained from a defined point within each cDNA that are long enough to uniquely identify each transcript (Velculescu et al., 1995; Saha et al., 2002).

2. How is SAGE performed?

mRNA molecules from a sample of interest are bound at their 3′ ends using magnetic beads conjugated to oligo (dT), and first- and second-strand cDNA is synthesized. The sample is then digested to completion with a four-base cutter restriction enzyme (usually Nlalll by convention, although any enzyme can be used) (see Figure 1, Step 1). All sequences that lie 5′ of the 3′-most Nlalll site is thus lost, and this Nlalll site will define the position of the sequence tag.

A linker containing both a PCR primer and a type IIS restriction site is then ligated to the 5′ end of the cDNA (Figure 1, Step 2). The sample is typically divided into two at this point, with linkers that differ only in the PCR primer sequence being used. The sample is then digested with the type IIS enzyme whose site is encoded in the linker. Depending on the enzyme used, this enzyme cleaves within the cDNA anywhere from 10 to 22 bp 3′ of the NlaIII site, generating the SAGE tag for that particular transcript. Cleaved tags are typically blunt-ended, after which cleaved tags ligated to the two sets of linkers are combined, ligated together to form ditags, and amplified by PCR (Figure 1, Step 3).


Ditags are then recleaved with NlaIII, gel purified, ligated to form concatemers, and subcloned into plasmid vectors to form a SAGE library. Individual clones are then isolated and sequenced (Figure 1, Step 4). Finally, SAGE tags are matched to genes using appropriate software, and tag abundance levels from the library of interest are compared to those from other libraries to identify differentially expressed genes (Figure 1, Step 5) where gene expression levels are determined simply by tag count. Duplicate ditags are discarded from the analysis, thus controlling for any tag-specific biases in PCR amplification.

SAGE library construction and analysis

Figure 1 SAGE library construction and analysis

3. Digital- versus hybridization-based expression profiling

Digital-based expression profiling technologies such as SAGE have a number of advantages over hybridization-based approaches such as microarray analysis (see Article 90, Microarrays: an overview, Volume 4). Digital-based approaches are nearly comprehensive – sampling all expressed genes, rather than being limited by one’s choice of probes. Absolute levels of gene expression (i.e., individual tags as a percentage of total tags) are obtained – which making it straightforward to compare data sets obtained by different labs and large amounts of public digital expression data are already available from sites like the NCBI SAGE map project (http://www.ncbi.nlm.nih.gov/SAGE). It should be noted, however, that tag abundance measurements are more accurate for highly expressed transcripts than for those expressed at low levels. Only the widely used molecular approaches of cDNA library construction and DNA sequencing are used to generate the data, making it easy to initiate these studies. Very little experimental error is introduced in the construction of the libraries (Blackshaw et al., 2003), with PCR amplification biases being controlled by the elimination of duplicate ditags from analysis. The sensitivity of digital-based expression profiling is only limited by one’s ability to sequence – one could, in principle, sequence enough tags to cover every mRNA in the sample of interest. Finally, since there is no hybridization background or variations in probe quality to consider, interpretation of digital gene expression data is quite straightforward and easily performed with a simple set of statistical tools (see Article 53, Statistical methods for gene expression analysis, Volume 7).

Hybridization-based systems have a number of advantages of their own, however, cost and speed being the main ones. Commercially available oligonucleotide arrays (see Article 92, Using oligonucleotide arrays, Volume 4) giving nearly complete coverage of the human genome can be hybridized in triplicate for ~$3000, while constructing and sequencing 50 000 tags from a SAGE library can cost ~$15 000. Furthermore, construction of SAGE libraries can require two weeks of work, and performing the required number of sequencing reactions can take months, while microarray hybridization can be completed in two days and many arrays can be run in parallel. The considerable variation in gene expression seen among biological samples (Blackshaw et al., 2003; Pritchard et al., 2001) is best controlled for by obtaining multiple replicates of expression profiles – a proposition that is difficult to accomplish with digital-based methods. Since all digital-based methods rely on sampling, the “real” tag count in a sample actually represents a Poisson distribution around the observed tag count in a library (Audic and Claverie, 1997), introducing a level of uncertainty that is unacceptable for certain studies. Though expression profiling of very small amounts of starting material (i.e., <1 |g total RNA) typically requires amplification of the starting material and leads to skewed representation of certain transcripts, the comparative hybridization of amplified samples can control for this, making it the method of choice in such situations. Finally, variations in efficiency of cDNA-target hybridization among probes can result in expression of some rare genes being detected quite efficiently, whereas rare genes are only efficiently detected at high tag numbers by digital-based methods.

4. Hybridization or digital-based expression profiling – which to use?

The choice of a digital or hybridization-based approach to profiling gene expression depends to a large extent on one’s experimental aims. If the primary aim of a study is gene discovery, or more generally, to analyze gene expression in a limited number of samples in great depth and at high resolution, digital-based methods such as SAGE are probably the method of choice. Digital-based methods are particularly well suited to generating publicly accessible databases of gene expression. Hybridization-based methods, on the other hand, are the method of choice when a large number of samples need to be screened or when very small amounts of starting material are used. More generally, hybridization-based approaches are probably the best choice for exploratory studies where a large initial investment of resources is not desired.

5. Comparison of SAGE with other methods of digital-based expression profiling

Since 20-40 SAGE tags are obtained with each sequencing reaction as opposed to 1 tag per reaction with conventional EST sequencing, SAGE is much more efficient at profiling gene expression than EST sequencing. Likewise, the great majority of public EST data is obtained from libraries that have been both normalized and subtracted, and thus do not accurately reflect mRNA levels in the sample in question. MPSS has many of the same advantages of SAGE and is, in principle, more rapid, although costs per sample are high at present and the technology is not widely available. Neither SAGE nor MPSS, however, gives extensive information about alternative splicing of mRNAs, and both approaches require a fully sequenced genome to be maximally useful. Conventional EST sequencing may be the method of choice in cases in which either of these factors is a concern.

6. Variations and choice of methodology in SAGE

A number of variations of SAGE have been developed in recent years, including the generation of 5′-anchored libraries using affinity to the 5′ cap structure or, in Caenorhabditis elegans, to the trans-spliced 5′ leader exon found in virtually all mRNAs (Hwang etal., 2004; Wei etal., 2004). The choice of SAGE tag length varies with the complexity of the genome one is analyzing and the specific topic addressed. All things being equal, a shorter tag is preferable, since more tags can then be obtained in each sequencing reaction. Since each tag specifies 4″ possible combinations, where n = tag length, a 10-bp tag can specify over 106 combinations – more than enough to unambiguously define the majority of transcripts in most organisms, provided cDNA sequences for the transcripts are available. However, to identify a unique match in genomic DNA and allow de novo transcript annotation, anywhere from 1010 to 1011 combinations must be specified, tags longer than 13 bp (13 unique bp + 4 bp from the Nlalll site = 417 combinations) are required.

The number of tags sequenced will depend on the number of transcripts per cell in the sample tested, the sensitivity desired, and the cellular complexity of the sample examined. Abundant transcripts will be accurately profiled with relatively low tag counts, but rare mRNAs in abundant cell types or abundant mRNAs selectively expressed in rare cell types will require much higher tag counts to profile accurately.

Next post:

Previous post: