Biology Reference
In-Depth Information
programs are often referred to as 'ab initio' or 'de novo'
gene finders. They are the programs of choice in the
absence of known transcript and protein sequences for the
interrogated genome, and of other phylogenetically
related genomes. If the sequence of related genomes is
available, programs exist that combine this intrinsic
information with the patterns of genomic sequence
conservation. These programs are often referred to as
comparative (or dual- or multi-genome) gene finders
(panel 7). The most sophisticated of such programs use an
underlying phylogenetic tree to appropriately weight
sequence
integration of a great variety of types of predictions, which
include not only gene predictions, but also predictions of
individual sites and exons. In spite of all development
efforts in computational gene finding, the most reliable and
complete gene annotations are still obtained after the initial
alignments of cDNA and proteins with the genome
sequence have been inspected manually to establish the
exon boundaries of genes and transcripts (panel 13). This is
the task carried out by the HAVANA team at the Sanger
Institute. The initial manual annotation can be even further
refined by subsequent experimental verification of those
transcript models lacking sufficiently strong evidence, such
as in the GENCODE project (panel 14, and see below).
conservation
depending
on
evolutionary
distance [26
29] .
If cDNA, ESTs or RNASeq data are available, these
often take priority over other existing sources of informa-
tion. In these cases, the initial map of the transcript or
protein sequences of a genome can be obtained using
a variety of popular tools, such as BLAT [30] or BLAST
[31] ; or other tools devoted specifically to the mapping of
short sequence reads. This map can be refined using more
sophisticated 'splice alignment' algorithms, whose explicit
splice site models allow more precise alignment across
gaps corresponding to introns (panel 10). In the figure, we
show a protein to genome alignment produced by the
GENEWISE [32] program. Each amino acid in the protein
sequence aligns with a codon, and the large intron gaps are
delimited by the canonical GT-AG splice dinucleotides.
Alternatively, cDNA and protein information can be fed
into an ab initio gene finder algorithm to inform the exons
included in the prediction (panel 10). RNASeq reads can
also be used. A number of methods have been developed to
delineate gene structures from RNASeq reads and to
quantify transcript abundances [22,33,34] . Two general
approaches are possible. Reads can be assembled before
mapping them to the genome (panel 5). In this way, contigs
similar to ESTs can be created, and pipelines to deal with
ESTs can be then used. Alternatively, reads can be directly
mapped to the genome (if available for the species under
investigation), and used to inform gene prediction
programs (panel 6).
Often cDNA and protein evidence for a given genome is
only partial; in such cases the initial reliable gene and
transcript set may be extended with more hypothetical
models derived from ab initio or comparative gene finders,
or from the genome mapping of cDNA and protein
sequences from other species. Pipelines have been derived
that automate this multistep process (panel 9). More
recently, programs have been developed that combine the
output of many individual gene finders (panel 11). The
underlying assumption in these 'combiners' is that
consensus across programs increases the likelihood of the
predictions. Thus, the predictions are weighted according
to the particular features of the program producing them.
The most general of such frameworks allow for
e
The Use of Chromatin Marks
Recently an entirely different approach has been employed
to identify genes in genome sequences on the basis of
exploiting information on chromatin structure. Efficient
methods have been developed to create genome-wide
chromatin-state maps using chromatin immunoprecipita-
tion followed by massively parallel sequencing (ChIP-Seq).
These maps have revealed that genes actively transcribed
by RNA polymerase II (Pol II) are marked by trimethyla-
tion of lysine 4 of histone H3 (H3K4me3) at their promoter,
and trimethylation of lysine 36 of histone H3 (H3K36me3)
along the length of the transcribed region. Computational
methods have been developed to identify these so-called
K4
K36 domains in genome-wide chromatin-state maps.
Using these methods, Guttman et al. discovered thousands
of novel long non-coding RNAs (lncRNAs) [10] .
e
Assessing the Reference Transcriptome
Standard metrics and data sets have been developed to
benchmark the accuracy of computational gene-finding
methods [35] . Community-wide assessment projects, in
which gene predictions obtained by different groups in
a benchmark set of genomic sequences are evaluated in an
unbiasedway, are alsovery popular in the field. In 1999GASP
was organized to evaluate gene prediction programs in the
Drosophila melanogaster genome [36] . It was continued in
2005withEGASP, to evaluategene prediction programs in the
human genome [37] . Other gene-finding community assess-
ment projects are NGASP [38] to evaluate gene prediction in
the Caenorhabditis elegans genome, and RGASP to evaluate
methods to produce reference transcriptomes using RNASeq
data ( http://www.gencodegenes.org/rgasp/ ) .
THE HUMAN TRANSCRIPTOME
The Number of Human Genes
The human transcriptome has been intensely investigated.
During recent decades hundreds of EST libraries have been
the
Search WWH ::




Custom Search