Biology Reference
In-Depth Information
are rich in coding sequences compared to genome segments lacking high GC
frequencies ( Jabbari and Bernardi 2000 ). The effectiveness of gene-finding
programs is based on the type of information used by the program and the
algorithm used to combine that information into a coherent prediction. Three
types of information are used to predict the location of genes: 1) “signals in the
sequence” such as splice-sites, 2) “content” statistics such as codon bias, and 3)
similarity to known genes ( Stormo 2000 ). Start and stop codons can be useful in
predicting exons. Unfortunately, they can be uninformative if the reading frame
is unknown. Some programs look for sites associated with promoters such as
TATA boxes, transcription-factor binding sites, and CpG islands. Poly(A) signals
are used to aid in identifying the carboxyl terminus of the gene. As the number
of known coding sequences increases, the accuracy of gene-prediction programs
improves because the larger sample size of known genes will allow for more
reliable statistical measures, as well as a much greater likelihood of encounter-
ing a gene that is related to one that has been identified previously.
Large genomic projects can only be analyzed computationally; continued
improvements in analysis and annotation methods are needed ( Ashburner 2000,
Pop and Salzberg 2008, Alkan et al. 2011 ). Advances have been made in identi-
fying DNA sequences as coding or noncoding. Although current methods leave
uncertainties, having the exact coding prediction is unnecessary. Even partially
correct predictions can focus experiments to determine the true gene structure
faster than would be possible if these predictions were unavailable. Continued
advances in computational and experimental methods for identifying genes,
their regulatory elements, and function are expected ( Stormo 2000, Baxevanis
and Ouellette 2001, Chen and Tompa 2010, Miller et al. 2010 ).
Yandell and Ence (2012) state, “Sequencing costs have fallen so dramati-
cally that a single laboratory can now afford to sequence large, even human-
sized genomes. Ironically, although sequencing has become easy, in many ways,
genome annotation has become more challenging.” The reasons for this include
the shorter read lengths of the NextGen sequencing platforms, so assemblies
rarely are able to obtain sufficient contigs to include the whole genome. Also,
as nonmodel organisms are sequenced, it becomes harder to identify genes in
these novel species, especially when as many as one-third of the genes could
be new, so-called orphan genes. Another issue is that today's annotation proj-
ects now involve scientists with little bioinformatics and computational biology
training. Unfortunately, genome annotation is not yet a “point-and-click” pro-
cess. However, Yandell and Ence (2012) provide a beginner's guide to genome
annotation that is very helpful.
Search WWH ::




Custom Search