Biology Reference
In-Depth Information
overlap transcripts from other genes, in both the sense and
the antisense direction. Often 5 0 ends of genes are found
very distant from the body of the gene, overlapping exons
from other distal genes [1,59] .
The 20 687 protein-coding loci in GENCODE encode
122 099 transcripts. Therefore, there are on average six
alternative transcripts per protein-coding locus. Most
transcripts at protein-coding genes (
interact with known epigenetic regulatory proteins to
regulate gene expression.
The discovery and study of lncRNAs is of significant
relevance to human biology and disease, as they represent
a huge, largely unexplored, and functional component of the
genome [65,66] . It has been proposed that these may explain
human-specific traits [67,68] and more practically, may
underlie the deficiencies in rodent models of human
diseases. There is diverse evidence that lncRNAs are inti-
mately involved in gene networks underlying cancer: an
antisense lncRNA that is overexpressed in leukemia
represses expression of the p15 tumor suppressor [69] . Also,
in genome-wide association studies the long non-coding
RNA CDKN2B- AS1 (also called ANRIL) has been associ-
ated with diverse diseases such as diabetes, glioma and basal
cell carcinoma [70] . Elsewhere, an intergenic lncRNA, linc-
p21, functions as a downstream effector of the p53 tumor
suppressor [71] . And MEG3 activates p53 through an
unknown mechanism [72] . Evidence is also mounting that
numerous neurological diseases involve components of
toxic RNA gain-of-function mutations (particularly trinu-
cleotide repeat disorders) [73] ; or involve misregulation of
coding genes by antisense transcripts [74] . Given the lack of
lncRNA annotation in the human genome until very
recently, it is likely that many 'intergenic' disease-associ-
ated loci discovered in genome-wide association studies in
fact modify the regulation or function of lncRNAs.
In the early 2000s the FANTOM consortium pioneered
the genome-wide discovery of lncRNAs in mouse,
publishing a set of 34 030 lncRNAs based on cDNA
sequencing [75] . Recently, a catalogue of 5446 human
lncRNAs has been created by Jia et al. [76] based on
a computational pipeline of sequenced cDNAs. Meanwhile,
the large intervening non-coding RNAs ('lincRNAs') [10] ,
discovered through epigenetic annotation of human and
mouse genomes, represent a useful set of RNAs but omit
the many lncRNAs that reside within or overlap protein-
coding loci. The GENCODE consortium within the
ENCODE project has for several years been manually
annotating a comprehensive set of human lncRNAs. Early
releases of the GENCODE annotation have already been
used to investigate the potential function of these tran-
scripts (see, for instance, Ørom et al. [77] ). GENCODE
constitutes the most exhaustive collection of human
lncRNAs available to date. Version 7 includes 9640 long
non-coding RNA loci producing 14 880 transcripts.
A number of large-scale analyses of the GENCODE and
other lncRNA collections [78,79] have revealed a number
of features characteristic of this class of RNAs. LncRNAs
are generated through pathways similar to those of protein-
coding genes. They appear to be under similar transcrip-
tional regulatory control, as their promoters are marked by
the same set of histone modifications (e.g. histone H3
lysine 4 methylation), and the majority are spliced by
98%) are multi-
exonic. On average there are 8.1 exons per transcript.
Surprisingly, only 76 052 (62%) of the transcripts at
protein-coding genes appear to code for proteins (on
average 3.7 coding transcripts per locus). The role of the
non-coding transcripts associated with protein-coding loci
is mostly unknown, although evidence of a regulatory role
is known for some of them. For instance, Sox2 is a tran-
scription factor and plays a key role in the maintenance of
the undifferentiated state of embryonic and adult neural
stem cells. The Sox2 gene encodes a long non-coding RNA
that shares the same transcriptional orientation as Sox2
[60] . The Sox2 lncRNA is expressed in the neurogenic
region of the adult mouse brain, and is dynamically regu-
lated during vertebrate development of the central nervous
system, implying a role in regulating self-renewal and
neurogenesis of neural stem cells.
Furthermore, within a given locus different protein-
coding transcripts often differ only in their untranslated
sequences. Therefore, the effective number of different
protein sequences encoded at protein-coding loci is less
than two on average [9] . Moreover, many of these protein
variants appear to contain truncated functional domains
having markedly different structures and functions from
their more constitutively spliced counterparts [61] . For the
vast majority of these alternative isoforms, little evidence
exists to suggest they have a role as functional proteins.
Therefore, only a small fraction of the enormous
complexity observed at the transcriptional level appears to
be translated into protein complexity.
>
The Long Non-coding RNA Transcriptome
Although they are poorly understood, long non-coding
RNAs (lncRNAs) are emerging as central players in cell
biology. LncRNAs are long, multi-exonic transcripts, often
polyadenylated, and the loci that encode them exhibit the
epigenetic marks typical of transcribed regions [10] . Over
the last two decades the role of a small number of lncRNAs
has been extensively researched in key epigenetic circuits.
The 19 kb XIST RNA is necessary and sufficient for the
silencing in cis of an entire X-chromosome in placental
mammals [62] . In genomic imprinting at least three
lncRNAs are known to stably silence specific parental
alleles through both cis and trans mechanisms (Air [63] ;
H19 [63] ; Kcnq1ot1 [64] . All of these have been shown to
Search WWH ::




Custom Search