Biology Reference
In-Depth Information
generated and thousands of microarray experiments have
been performed, monitoring gene expression in many
different cell types, individuals, conditions, etc., providing
an extremely rich characterization of the transcriptional
diversity and complexity of the human genome. It may
therefore come as a surprise that the exact number of genes
in the human genome is not yet precisely known, much less
the number of transcripts encoded by these genes. Prior to
the publication of the human genome sequence, the number
of human genes had been estimated to be in the range of
70 000
annotate the large amount of unfinished genomic data being
produced as part of the public Human Genome Project, as
well as to provide browser capacity for both sequences and
annotations. Ensembl has expanded and now generates
automatic predictions for over 35 species. The Ensembl
gene build process is based on alignments of protein and
cDNA sequences to produce a highly accurate gene set with
low false positives [49] . For human, Ensembl has recently
merged with the GENCODE project (see below).
Another site supplying sequence and annotation data for
a large number of genomes is the University of California,
Santa Cruz (UCSC) genome browser database [50] .The
UCSC Genes track includes both protein-coding genes and
non-coding RNA genes. Both types of gene can produce
non-coding transcripts, but non-coding RNA genes do not
produce protein-coding transcripts. This is a moderately
conservative set of predictions. Transcripts of protein-
coding genes require the support of one RefSeq RNA, or
one GenBank RNA sequence plus at least one additional
line of evidence. The latest release (February 2012) of the
UCSC genes includes 80 922 transcripts corresponding to
31 227 genes.
Manual annotation still plays a significant role in
annotating high-quality finished genomes. Currently the
National Center for Biotechnology Information (NCBI)
reference sequences (RefSeq) collection provide a highly
manually curated resource of multispecies transcripts and
includes plant, viral, vertebrate and invertebrate sequences
[51,52] . These are, as their name indicates, transcript
orientated and usually rely on full-length cDNAs for reli-
able curation, although the dataset also contains predictions
using ESTs and partial cDNAs aligned against genomic
sequence using the Gnomon prediction program. RefSeq is
a very reliable but also conservative gene reference set.
Other reference sets usually include RefSeq, but extend it
substantially. Compared to RefSeq, the UCSC gene set, for
instance, has about 10% more protein-coding genes,
approximately four times as many putative non-coding
genes, and about twice as many splice variants.
The Havana group at the Wellcome Trust Sanger Insti-
tute produces its annotation on vertebrate genomes by
mapping transcript evidence (mostly from known cDNA
sequences) onto the genome, and manually curating the
resulting alignments. Currently only three vertebrate
genomes
100 000 [39] . These numbers were mostly based
on the experimental estimation of the number of CpG
islands that exist in the human genome. CpG islands are
genomic regions that contain a high frequency of CG
dinucleotides. They are in and near approximately 60% of
human promoters [40] and through methylation of the
cytosine they may play a role in the regulation of gene
expression. Their distinctive properties allow their physical
separation from bulk DNA. In this way, Antequera and Bird
estimated approximately 45 000 CpG islands in the human
genome, which produced a rough estimate of about 80 000
human genes.
This number was widely accepted by the scientific
community. Therefore, one of the surprises of the publi-
cation of the draft human genome sequences was that the
number of human genes was estimated to be considerably
lower. Indeed, Lander et al. [41] estimated the number of
protein-coding genes to be in the range of 30 000
e
40 000,
and Venter et al. [42] estimated it to be in the range of
27 000
e
38 000. The publication of the human genome
draft sequence did not end the dispute, and just before and
after the publication of this sequence estimates with very
discrepant numbers ranging from 30 000 to 120 000 genes
were published [43
e
45] . Since then, some consensus
appears to have been reached, and although the exact
number of human genes is not yet known, there is agree-
ment that the number of protein-coding genes in the human
genome is likely to be between 20 000 and 21 000 [46] .
However, evidence has since emerged for an unanticipated
large number of lncRNAs. Whereas a few years ago only
a few dozen examples were known, currently about 10 000
long non-coding RNA genes are annotated on the human
genome. This number, however, is likely to be an under-
estimate, and the overall number of human genes (coding
and non-coding, long and small) may in the end be not too
distant from that in the earlier estimates.
e
are sequenced to
a quality that merits manual annotation [53] . In the case of
the human genome, the work by the Havana team is at the
core of the GENCODE annotation produced [9] within the
framework of the ENCODE project [55,56] . The GEN-
CODE annotation extends the Havana manually curated
transcript models with computational predictions that are
experimentally validated. GENCODE is the most compre-
hensive human genome gene set to date, and is becoming the
standard reference in many large-scale genome projects,
human, mouse and zebrafish
e
e
Human Genome Reference Gene Sets
Since the publication of the draft human genome sequence
in 2001 [47,48] a number of human gene reference sets
have been created using either computational prediction,
manual annotation, or a hybrid mixture of the two methods.
The Ensembl project was initially set up to warehouse and
Search WWH ::




Custom Search