Biology Reference
In-Depth Information
Precise definition of gene boundaries is of utmost importance when trying to
link coordinated variation of gene expression with the specific recruitment
of transcription factors or modification of the chromatin architecture. In par-
ticular, most of the current gene models capture only the coding sequence
(CDS), without taking into account the untranslated regions (UTRs). The
upstream boundary of the 5 0 UTR defines the actual transcription start site
(TSS) and proximal promoter region, with which many regulatory elements
are associated and which are required for transcriptional regulation. The
same is true for 3 0 UTRs, which are underrepresented and often contain
numerous microRNA-binding sites, which may be important posttranscrip-
tional regulators.
Traditionally, the definition of gene models relies on the use of a collec-
tion of softwares ( ab initio gene prediction, comparison to databases, etc.) and
functional data (ESTs, full-length cDNA, etc.). Between 2000 and 2007, the
number of EST for X. tropicalis has increased tremendously, reaching more
than 1.2 million ESTs (grouped by unigene library entry,
5 million alto-
gether), making it one of the model with the highest number of ESTs. The
ESTs were clustered and aligned to the genome sequence in order to define
potential exons ( Nagaraj, Gasser, & Ranganathan, 2007 ). Although this
strategy can be quite effective when large collections of EST are used, the
process remains error prone (depending on the quality of the collections)
and often lacks the resolution required to describe alternative transcripts
( Nagaraj et al., 2007 ). As can be seen in Fig. 10.2 B, the gene TR b is asso-
ciated to a single EST cluster far upstream of the annotated TSS in Ensembl
or XenBase, while only the RefSeq annotation reflects boundaries that
actually encompass the EST cluster.
Gene models can be inferred from the genomic sequence using compu-
tational methods. Popular gene prediction softwares include genewise
( Birney, Clamp, & Durbin, 2004 ), geneMark, FGENESH ( Salamov &
Solovyev, 2000 ), and EUGENE ( Schiex, Moisan, &RouzĀ“, 2001 ), to name
a few. This process suffers several pitfalls and always benefits from multiple
lines of evidence. In addition, assembled genome must be of relatively good
TR b transcription cannot be measured with Affymetrix arrays. (C) Illustrative example of
defunct Affymetrix probesets. The defunct probesets are composed of probes poorly
specific of a given gene/transcript. The first track corresponds to the XenBase gene
models as in (A). The second track corresponds to genomic region spanned by the cu-
rated probesets that can be used for analysis, while the third track (defunct_probesets)
corresponds to the genomic region spanned by defunct probesets. The region shown is
located on scaffold_464.
Search WWH ::




Custom Search