Biology Reference
In-Depth Information
overlap transcripts from other genes, in both the sense and
the antisense direction. Often 5
0
ends of genes are found
very distant from the body of the gene, overlapping exons
from other distal genes
[1,59]
.
The 20 687 protein-coding loci in GENCODE encode
122 099 transcripts. Therefore, there are on average six
alternative transcripts per protein-coding locus. Most
transcripts at protein-coding genes (
interact with known epigenetic regulatory proteins to
regulate gene expression.
The discovery and study of lncRNAs is of significant
relevance to human biology and disease, as they represent
a huge, largely unexplored, and functional component of the
genome
[65,66]
. It has been proposed that these may explain
human-specific traits
[67,68]
and more practically, may
underlie the deficiencies in rodent models of human
diseases. There is diverse evidence that lncRNAs are inti-
mately involved in gene networks underlying cancer: an
antisense lncRNA that is overexpressed in leukemia
represses expression of the p15 tumor suppressor
[69]
. Also,
in genome-wide association studies the long non-coding
RNA CDKN2B- AS1 (also called ANRIL) has been associ-
ated with diverse diseases such as diabetes, glioma and basal
cell carcinoma
[70]
. Elsewhere, an intergenic lncRNA, linc-
p21, functions as a downstream effector of the p53 tumor
suppressor
[71]
. And MEG3 activates p53 through an
unknown mechanism
[72]
. Evidence is also mounting that
numerous neurological diseases involve components of
toxic RNA gain-of-function mutations (particularly trinu-
cleotide repeat disorders)
[73]
; or involve misregulation of
coding genes by antisense transcripts
[74]
. Given the lack of
lncRNA annotation in the human genome until very
recently, it is likely that many 'intergenic' disease-associ-
ated loci discovered in genome-wide association studies in
fact modify the regulation or function of lncRNAs.
In the early 2000s the FANTOM consortium pioneered
the genome-wide discovery of lncRNAs in mouse,
publishing a set of 34 030 lncRNAs based on cDNA
sequencing
[75]
. Recently, a catalogue of 5446 human
lncRNAs has been created by Jia et al.
[76]
based on
a computational pipeline of sequenced cDNAs. Meanwhile,
the large intervening non-coding RNAs ('lincRNAs')
[10]
,
discovered through epigenetic annotation of human and
mouse genomes, represent a useful set of RNAs but omit
the many lncRNAs that reside within or overlap protein-
coding loci. The GENCODE consortium within the
ENCODE project has for several years been manually
annotating a comprehensive set of human lncRNAs. Early
releases of the GENCODE annotation have already been
used to investigate the potential function of these tran-
scripts (see, for instance, Ørom et al.
[77]
). GENCODE
constitutes the most exhaustive collection of human
lncRNAs available to date. Version 7 includes 9640 long
non-coding RNA loci producing 14 880 transcripts.
A number of large-scale analyses of the GENCODE and
other lncRNA collections
[78,79]
have revealed a number
of features characteristic of this class of RNAs. LncRNAs
are generated through pathways similar to those of protein-
coding genes. They appear to be under similar transcrip-
tional regulatory control, as their promoters are marked by
the same set of histone modifications (e.g. histone H3
lysine 4 methylation), and the majority are spliced by
98%) are multi-
exonic. On average there are 8.1 exons per transcript.
Surprisingly, only 76 052 (62%) of the transcripts at
protein-coding genes appear to code for proteins (on
average 3.7 coding transcripts per locus). The role of the
non-coding transcripts associated with protein-coding loci
is mostly unknown, although evidence of a regulatory role
is known for some of them. For instance, Sox2 is a tran-
scription factor and plays a key role in the maintenance of
the undifferentiated state of embryonic and adult neural
stem cells. The Sox2 gene encodes a long non-coding RNA
that shares the same transcriptional orientation as Sox2
[60]
. The Sox2 lncRNA is expressed in the neurogenic
region of the adult mouse brain, and is dynamically regu-
lated during vertebrate development of the central nervous
system, implying a role in regulating self-renewal and
neurogenesis of neural stem cells.
Furthermore, within a given locus different protein-
coding transcripts often differ only in their untranslated
sequences. Therefore, the effective number of different
protein sequences encoded at protein-coding loci is less
than two on average
[9]
. Moreover, many of these protein
variants appear to contain truncated functional domains
having markedly different structures and functions from
their more constitutively spliced counterparts
[61]
. For the
vast majority of these alternative isoforms, little evidence
exists to suggest they have a role as functional proteins.
Therefore, only a small fraction of the enormous
complexity observed at the transcriptional level appears to
be translated into protein complexity.
>
The Long Non-coding RNA Transcriptome
Although they are poorly understood, long non-coding
RNAs (lncRNAs) are emerging as central players in cell
biology. LncRNAs are long, multi-exonic transcripts, often
polyadenylated, and the loci that encode them exhibit the
epigenetic marks typical of transcribed regions
[10]
. Over
the last two decades the role of a small number of lncRNAs
has been extensively researched in key epigenetic circuits.
The 19 kb XIST RNA is necessary and sufficient for the
silencing in cis of an entire X-chromosome in placental
mammals
[62]
. In genomic imprinting at least three
lncRNAs are known to stably silence specific parental
alleles through both cis and trans mechanisms (Air
[63]
;
H19
[63]
; Kcnq1ot1
[64]
. All of these have been shown to