Biology Reference
In-Depth Information
being analyzed, Mortazavi et al.
[21]
estimate that 1 RPKM
corresponds to 1
statistical models (
Figure 2.1
). There are three broad
sources of information that can be used to pinpoint the
location and exonic structure of genes
3 transcripts per cell. Given the nature of
the data produced by RNASeq experiments (millions of very
short sequence fragments), both transcript reconstruction and
quantification are challenging and have triggered active
research in computational biology. A number of tools are
currently available to perform such tasks (see next section),
but their accuracy and reliability have not yet been properly
benchmarked.
e
in genomic
sequences.
Sequence comparisons across genomes (Panel 1 in
Figure 2.1
). Functional regions in genomes are subjected
to natural selection, and therefore are more conserved
through evolution than non-functional ones. Regions that
code for proteins are among the most conserved through
evolution. Moreover, they have a specific mode of
conservation, which can be computationally detected
l
Computational Methods
The contribution of computational methods is essential to infer
the gene and transcript set characteristic of a given genome.
First, data produced by technologies to globally monitor
transcriptomes, such as those reviewed in the previous section,
cannot be processed without sophisticated computational
tools. Second, experimental approaches provide only infor-
mation on the location and exonic structure of the genes
expressed in the conditions that have been surveyed. Unless
a very large panel of heterogeneous conditions has been
monitored, transcribed sequences detected in this way capture
only a fraction of the reference transcriptome.
Current computational methods use a variety of
heterogeneous sources of information, which are processed
and integrated through complex computational
e
for instance, because of the degeneracy of the genetic
code, in coding regions the third codon position is usually
less conserved than the first and second positions.
Conservation is also a function of evolutionary distance.
Conservation in genomes of closely related species
extends beyond functional regions; on the other hand, it
may have already vanished in these regions when
distantly related species are compared.
Intrinsic sequence features in the genome. There are
sequence features in the genomic sequence that are
revealing of the existence of genes. These features are of
two types: intrinsic sequence signals involved in gene
specification (panel 2), and statistical bias in the genome
sequence specific to protein-coding regions (panel 3).
l
and
FIGURE 2.1
Methods to determine reference transcriptomes. See text for details. The yellow portion in the input cDNA sequences represents the
UTRs. Vertical lines in cDNA sequences as well as in input protein sequences represent the location of the exons. Arrows indicate methods. Dashed lines
indicate methods that depend on previously constructed gene models. Adapted from Harrow et al.
[83]
.