Expressed Sequence Tag (Molecular Biology)

An expressed sequence tag (EST) is a short stretch of DNA sequence that is used to identify an expressed gene. Although EST sequences are usually only 200 to 500 nucleotides in length, this is generally sufficient to identify the full-length complementary DNA (cDNA). ESTs are generated by sequencing a single segment of random clones from a cDNA library. A single sequencing reaction and automation of DNA isolation, sequencing, and analysis have allowed the rapid determination of many ESTs. Now, the majority of the sequences in sequence databases are ESTs. Although most ESTs have been isolated from humans, a large number of ESTs have been isolated from model organisms, such as Caenorhabditis elegans (1), Drosophila, rice (2), and Arabidopsis (3). ESTs are also being isolated from more exotic organisms, such as Entamoeba histolytica (4) and Leishmania major promastigotes (5). ESTs have numerous uses, from genetic mapping to analyzing gene expression, and the number of ESTs isolated from different organisms will continue to rise rapidly.

1. Generation of ESTs

The most important step in generating of ESTs is producing a cDNA library. First, messenger RNA (mRNA) is extracted from the material being studied and is used as a template by reverse transcriptase for cDNA synthesis. Then the DNA is cloned into a suitable vector to produce a cDNA library. Random clones are isolated from the library, and one or both ends are sequenced by single-pass sequencing.


The source of the mRNA is an organism, tissue, or cell line grown under normal conditions or treated with hormones, drugs, heat, etc. The source of the starting material depends on the type of genes to be identified and the reason for generating the ESTs. EST analysis is used to find previously undiscovered tissue-specific genes or to tag as many different expressed genes as possible. It is also used to examine the relative abundance of expressed genes. Libraries used for the first two types of studies are often normalized, so that highly-expressed genes and rare genes are represented more equally in the library or subtracted to reduce or eliminate the number of highly abundant clones (see Subtractive Hybridization). An alternative to normalizing or subtracting libraries is to sequence a small number of ESTs from as many different tissue types and treatments as possible because different types of genes are highly expressed under different circumstances.

It is useful if the cDNAs in the library are cloned directionally, so that ESTs are isolated specifically from either the 5′- or the 3′-end of the cDNA. 3′-ESTs often represent the 3′-untranslated region of the mRNA and are used to separate members of gene families that have similar coding sequences, whereas 5′-ESTs generally represent coding sequence and give a better idea of the type of gene being expressed.

If the starting material for mRNA isolation is limiting, cDNA minilibraries suitable for EST analysis are generated by arbitrarily primed RT-PCR (6).

2. The Multitudinous Uses of ESTs

2.1. Obtaining Full-Length cDNAs from ESTs

An EST is useful because it represents an expressed gene. Once an EST has been identified, it is usually straightforward to obtain the full sequence of the "tagged" cDNA. Some sources provide EST DNA that is used in further studies. Alternatively, the EST sequence is used to design PCR primers or hybridization probes for cloning the full-length cDNA (see Cloning). Another strategy is "virtual cloning", where computer analysis is used to array homologous ESTs and any other sequences from the same gene (7). Once isolated, then the full-length cDNA is used in mutagenesis, transgenic, and expression studies to analyze the gene function.

2.2. Genome Analysis and Mapping with ESTs

Many organisms, for example, humans, Arabidopsis, and C. elegans, will have their entire genomic DNA sequenced in the relatively near future, and certain genomes, such those of Saccharomyces cerevisiae and various bacteria, have already been entirely sequenced. In the meantime, EST analysis provides a rapid way to identify expressed genes. Now, it is not possible to predict coding sequences in genomic DNA reliably from sequence information alone. Even when complete genomic sequences are available, ESTs are useful in analyzing the genomic sequence, for example by verifying putative coding sequences and confirming intron and exon boundaries. They are also helpful in distinguishing pseudogenes from real genes and in identifying alternatively spliced transcripts, which could never be predicted from the genomic sequence alone.

ESTs are also used to provide markers for genomic mapping by converting them into "sequence tagged sites" (STS). An STS is a set of PCR primers that identifies a single gene, and STS primers are designed from the EST sequence (8). An STS can be mapped to a specific region of the genome, and if DNA contigs are available for the genome under study, the STS can specify a genomic clone. Mapping of ESTs to genomic clones will provide a map of expressed genes for each clone and, in humans, these maps can be used to identify candidate genes for inherited human diseases that have been mapped to chromosomal regions by pedigree analysis.

2.3. Gene Expression Studies Using ESTs

As mentioned previously, gene expression in different organisms, tissues, or cells, before or after different treatments, can be studied by surveying ESTs isolated from libraries created from different mRNA sources. If nonnormalized and unsubtracted libraries are used to isolate ESTs, the frequency of isolation of a particular EST indicates the relative message abundance of the tagged gene (9). It is also possible to isolate novel tissue-specific genes by merely searching EST databases (10).

2.4. Rapid Identification of New Genes and Gene Families

As of 1998, the majority of ESTs in the databases represent genes that have not been previously identified. Traditionally, gene sequences have been obtained one at a time, first by isolating a protein, by identifying a mutant and cloning the mutated gene by complementation, or by positional cloning (see Cloning). Compared with these methods, generating ESTs is a very rapid method for identifying new gene sequences.

EST database searches also provide a rapid method for finding genes similar to a gene of interest. Methods, such as PCR, using degenerate primers, or hybridization at low stringency also enable the identification of similar genes, but these methods are often not as successful as database searches. Databases can be searched by using either nucleic acid sequences or protein sequences. Protein comparison searches are often more fruitful because protein sequences are better conserved. Additionally, databases can be used to search for similar genes among all the represented organisms simultaneously, thus facilitating comparative studies. Once similar genes have been identified, they can be translated into protein sequences and the database searched again to find the sequences most similar to this second set of genes. Thus it is possible to identify large families of related genes rapidly by EST database analysis.

2.5. Using ESTs to Clone Specific Genes

In humans, one main goal of genomic research and EST analysis is to aid in identifying and cloning human genes related to diseases. Various groups have begun to look between species to find candidate human disease genes. One group used an EST database search to identify human cDNAs homologous to previously cloned Drosophila genes that have an interesting developmental mutant phenotype (11). Then the human sequences were mapped by several methods, and the positions of the cDNAs compared with the map positions of human disease genes. Some of the cDNAs mapped to regions containing human disease genes that cause symptoms similar to the defects in the corresponding Drosophila mutant genes. Thus these cDNAs are candidates for the disease genes, and further studies can be undertaken to determine whether the cDNAs and the disease genes are one and the same. Another group is systematically identifying novel human ESTs related to known genes in a variety of model organisms (12). Then the ESTs are mapped to both human and mouse maps. Again, such genes can provide candidates for human disease genes, thereby potentially speeding up their cloning and aiding in their analysis by providing a model system for further studies.

ESTs are extremely useful tools for gene analysis in organisms besides humans, although to date most effort has been spent identifying human ESTs. In any organism where a mutation has been mapped to a certain genomic region or to a genomic clone, ESTs mapping to the same region identify candidate genes. In more unusual organisms, where few genes have been previously isolated, EST analysis provides a rapidly generated survey of the types of genes expressed by the organism and is used, for example, to isolate novel genes from pathogenic organisms (4, 5). In plants, the weed Arabidopsis is often used as a model system for gene isolation. EST database searches provide a rapid means of isolating similar genes from important crop plants, in much the same way that interesting genes from model systems are used to isolate corresponding genes from humans and mice. STSs from ESTs representing important plant genes could be used as markers in selective breeding programs.

3. Some Recent Twists on EST Generation

A technique termed serial analysis of gene expression (SAGE) has recently been developed to perform some of the functions of EST analysis (13). Short diagnostic sequence tags (9 to 10 bp) are randomly isolated from a tissue, concatenated, cloned, and sequenced. This method allows extremely rapid identification of thousands of genes and is used either to identify new genes or to analyze relative levels of gene expression. SAGE may prove quite useful in rapid surveys of gene expression differences between tissues or between developmental and disease states. ESTs still have many advantages over this method, however. The production and concatenation of the tags are somewhat cumbersome, and the small amount of sequence information in the tags precludes many of the interesting uses that have been found for ESTs.

An ingenious method has been developed to tag promoter-proximal sequences in mouse embryonic stem (ES) cells . Using a gene-trap retrovirus shuttle vector (14). A large number of ES cells were selected for neomycin resistance, indicating that the retrovirus containing a promoterless neomycin-resistance gene had inserted next to the promoter of an expressed gene. These neomycin-resistant ES cells were cloned. Some cloned cells were frozen and others used to extract DNA. The DNA flanking the retrovirus was isolated and sequenced to provide a promoter-proximal sequence tag (PST). PSTs are ESTs derived from genomic DNA rather than cDNA, and they are used similarly to screen sequence databases and to make STSs for mapping. However, they also have an additional advantage. Each PST represents a specific ES cell line harboring a potentially disrupted gene, and mutant ES cells are used to generate mutant mouse strains (15). PSTs thus allow rapid progression from sequence analysis of a gene to analysis of its function in a mutant mouse.

4. The Future of ESTs

ESTs have already proved useful in rapidly identifying novel genes, providing markers for mapping, providing candidate disease genes, and for analyzing gene expression. Because the generation of ESTs proceeds faster than the functional identification and analysis of the genes that they represent, much future research will be directed toward determining functions for the newly identified genes and gene families. The cDNAs identified by EST analysis can be used in standard experiments designed to elucidate their cellular and developmental functions. As so many novel genes have been and are being identified, some researchers are looking for more rapid ways to analyze function or expression of large numbers of genes. A recent innovation is the use of cDNA (16) or oligonucleotide (17) microarrays that represent large numbers of genes or whole genomes. Hybridization to microarrays of labeled mRNA samples isolated from different tissues, mutant backgrounds, or developmental or growth states allows analysis of changes in gene expression for a large number of genes simultaneously. Other researchers have developed a system for monitoring the effects of gene disruption on growth in yeast for many genes simultaneously (18). The development of additional techniques for global analysis of gene expression and function will be essential for rapidly characterizing the wealth of new genes identified by EST studies.

Next post:

Previous post: