Biology Reference
In-Depth Information
quality, as unsequenced portions as well as assembly errors can lead to
missing or truncated gene models. Also, the information content along
the genome can be extremely variable, in such a way that pseudogenes, du-
plications, short tandem repeats, and the overall complexity level of a region
can mislead the software. Many misannotated gaps inside the gene body,
as shown in Fig. 10.1 C, can further reduce the efficiency of gene
prediction. Furthermore, predicted gene models are often limited to the
CDS, and miss UTRs which, as previously stated, often contain important
regulatory elements. In X. tropicalis genome, some key genes involved in
metamorphosis or TH signaling are poorly annotated, which can hamper
downstream analysis of functional data. For example, based on the current
annotation, it would be impossible to find the T 3 RE located at the TSS of
TR b by looking for it near the 5 0 end of the gene (a very common practice,
although many binding sites may be missed as they are located away from the
TSS, see below). In fact, the 5 0 end of the TR b gene was determined by 5 0
RACE PCR and is correctly annotated only in the manually curated RefSeq
database. It is located 210 kb upstream of the XenBase and Ensembl models
( Fig. 10.2 B).
Gene models can be determined by full cDNA sequencing ( Klein et al.,
2002; Voigt, Chen, Gilchrist, Amaya, & Papalopulu, 2005 ). Although this
method is quite expensive and slow, it is also very accurate. Individual cDNA
clones are assembled from multiple single-pass sequenced reads, ensuring the
whole coverage of the cloned molecules. When combined, these approaches
can yield accurate models, although extensive curation is still required. While
there is an expected
22,000 protein coding genes, only half are actually
linked to accession numbers pointing to cDNA in the RefSeq database,
and only one-third are linked to full-length cDNA ( Gilchrist, 2012 ).
The lack of clear definition of gene boundaries renders the analysis of
coordinated variation of gene expression quite tricky. In fact, genome-wide
measure of differential gene expression has traditionally been carried out
using microarrays, whose design relies heavily on a good annotation
( Altmann et al., 2001 ). In this technology, fluorescent labeled cDNA is hy-
bridized to probes fixed on a solid surface. The readout of the signal for a
given spot then depends on the relative amount of cDNA present in the
sample. On the chip, transcript-specific probes define probesets, with the
redundancy of the individual probes allowing the assay to be more specific
and sensitive. The design of microarrays thus requires a comprehensive an-
notation, and the quality and exhaustiveness of the chip will greatly depend
on it. In model organisms (mouse, drosophila, etc.), microarrays have been
Search WWH ::




Custom Search