Biology Reference
In-Depth Information
statistical estimates. Homology boundaries can be improved by
matching the scoring matrix to the evolutionary distance of the
homologous domains.
4
Improving Search Performance
For most similarity searches, the choice of program ( blastp vs
fasta or blastx vs fastx ) is far less important than the choice of
database to be searched. As emphasized earlier, the most important
choice an investigator can make is to search a protein sequence
database, using either blastp / fasta for protein:protein compar-
isons, or blastx / fastx for translated-DNA queries against a
protein sequence database. Protein sequence searches have dra-
matic advantages: (1) they provide an evolutionarily look-back
time that is 5-10-times greater than DNA:DNA comparison; (2)
the statistical estimates from protein/translated-DNA:protein
alignments are many orders of magnitude more accurate than
DNA:DNA statistical estimates; (3) even the largest protein data-
bases are hundreds of times smaller than DNA datasets, and com-
prehensive protein sequence searches can be done against a few
hundred thousand to a few million sequences. Twenty years ago,
there were clear homologs in DNA sequence databases that had not
yet been entered into the protein databases. This is no longer true;
protein sequences are rapidly imported from genomic DNA
sequencing projects, and the protein databases are so comprehen-
sive that there are very few proteins to be found. Protein sequence
databases should be searched first.
4.1 Selecting the
Database to Search
Protein databases have become very large in the post-genome decade.
As this is written in Fall, 2012, comprehensive databases like NCBI's
nr and refseq_protein databases have 13- ( refseq )to20-
million sequences; the UniProtKB/TrEMBL database ( uniprot.
org ) contains more than 25-million protein sequences. Much of the
increase in the most comprehensive nr , refseq , and UniProtKB/
TrEMBL databases over the past 5 years has been driven by genome
sequencing projects. As a result, these databases have become very
redundant, 6 reducing search sensitivity.
Homology can be inferred from statistically significant
similarity; protein sequence alignment scores expected less than
once in 1,000 searches ( E ()
4.1.1 Statistics of
Similarity Scores: Searches
of Smaller Databases are
More Sensitive
10 3 ) are most easily explained
by inferring common ancestry. But the use of a statistical criteria
for inferring homology means that a similarity score that is
significant in some contexts may not be significant in others.
<
6 There are more than 2.5 million E. coli protein sequences from 200+ genomes available from the NCBI protein
databases.
 
Search WWH ::




Custom Search