Biology Reference
In-Depth Information
bioinformatics tools, while the latter has been enabled by a number of sophisticated
and versatile experimental tools.
Computational Tools
GENE IDENTIFICATION
In nature, there exists an incredibly diverse array of species adapted to survival in all kinds
of environments. In the genomes of these organisms are the templates for countless proteins
which catalyze a myriad of chemical transformations, many of which could be useful
for synthetic biology applications. As a result, computational tools are essential to rapidly
and accurately identify the true protein-coding sequences from DNA sequence data. The
earliest gene identification algorithms were developed mainly for the analysis of shorter
DNA sequences in which the exact coding sequence of a protein was ambiguous. These
methods were reasonably simplistic, but provided fairly accurate predictions of
coding
'
'
versus
sequences. For example, the TESTCODE algorithm devised in 1982
misclassified only 5% of test sequences, but drew no conclusion for 20% of test sequences. 2
'
noncoding
'
Subsequent prediction algorithms employed more sophisticated approaches to achieve
better results. For example, the GeneMark program of Borodovsky and McIninch
(initially referred to as GENMARK) combined nonhomogeneous Markov chain models with
Bayesian decision-making for coding sequence prediction. 3 This program also introduced
simultaneous analysis of both DNA strands as a method of improving accuracy. As the
sequencing of entire genomes became realized, the need for reliable gene prediction was
underscored. To improve the GeneMark program for entire bacterial genomes, a hidden
Markov model framework was implemented, as well as recognition of ribosome binding
site sequences. 4 Further improvements came with the application of self training for
new prokaryotic genome sequences, 5 and expansion to eukaryotic and viral systems. 6,7
GLIMMER represents a complementary tool for gene identification that was built on
interpolated Markov models. 8,9 This tool has similarly been adapted to eukaryotic DNA, 10,11
as well as endosymbiont and metagenome DNA. 12,13 These tools and others continue
to be indispensable in the identification of new and potentially interesting protein-coding
sequences from the ever-expanding volume of DNA sequence information available.
184
PREDICTION OF GENE FUNCTION
Synthetic biologists are typically interested in proteins for the transformations that they
catalyze, but sequence information alone is not enough to describe a protein
s utility.
Bench-top experiments both in vitro and in vivo are, of course, the best way to determine
protein function. However, the vast success in DNA sequencing and coding sequence
identification has provided such a wealth of putative protein targets that laboratory
characterization of them all is simply not feasible. Fortunately, if two proteins have similar
primary sequences, it is quite likely that they will also share similar functions. As a result,
sequence alignment and homology analysis based on proteins of known function have
proved vital to the accurate prediction of protein function from sequence data alone.
'
The earliest exercises in protein homology comparisons were carried out to evaluate
evolutionary relationships rather than to predict function. 14 16 Nevertheless, these
algorithms provided the groundwork upon which subsequent protein alignment tools
would be built. In 1985, Lipman and Pearson noted the increasing number of protein
sequences made available at the time, and that functions could be inferred by comparison
to other characterized proteins. As a result, they developed the FASTP algorithm for rapid
in silico comparison of a query sequence to a protein sequence database. 17 This was
followed by FASTA, which featured improved sensitivity, and LFASTA, which allowed for
analyses of local similarity. 18,19 Other tools followed, such as MSA and CLUSTAL W,
for the high-sensitivity alignment of smaller sets of proteins. 20,21
Search WWH ::




Custom Search