Biology Reference
In-Depth Information
The change in expectation value from E (4,000)
0.0006 to E
(13,000,000)
1.9 calculated in the previous section does not
mean the sequences are no longer homologous; it simply means
that their common ancestry cannot be distinguished from the
40-bit alignment scores that would be produced by chance because
of the large size of the database. Thus, pairwise blastp , fasta ,
and ssearch searches should be performed against the smallest
comprehensive databases that are likely to contain a homolog. For
sequences from vertebrates, the human protein set
(30,000-40,000 entries) is likely to contain homologs for all the
sequences that can be detected. Likewise, searches against taxo-
nomic subsets of sequence databases will improve sensitivity and
dramatically reduce the computation required.
Protein sequence databases differ not only in their size and
redundancy, but they also differ in their annotation quantity and
quality. The Swiss-Prot [ 19 ] subdivision of the UniProtKB Knowl-
edgebase provides a rich set of annotations and links to other
biological databases. UniProtKB/Swiss-Prot entries typically pro-
vide links to popular protein domain databases, homologous struc-
tures, E.C. numbers for enzymes, and information on functionally
critical residues and sequence variation. Both the NCBI and
EMBL-EBI web sites provide searches against the Swiss-Prot data-
base, which currently contains about 500,000 entries.
The NCBI's refseq_protein database can also provide rich
links to other biological resources. Unlike UniProtKB/Swiss-
Prot entries, each refseq_protein sequence is linked to a
refseq_mrna entry; this allows Multiple Sequence Alignments
with refseq_proteins to be converted to DNA-sequence multi-
ple alignments, which can be used for DNA-based and codon-based
evolutionary analyses. refseq_protein entries are also linked to
the NCBI Entrez-Gene resource, which provides links to variation,
clinical, and expression databases. Significant alignments in searches
against taxonomic subsets of refseq_protein yield rich genetic
information. Searches against full refseq_protein are less sensi-
tive, because the database is almost as large as nr , the largest protein
database offered at the NCBI.
Unfortunately, by default both the NCBI and EMBL-EBI web
sites offer their largest protein databases ( nr at the NCBI,
UniProtKB at the EMBL-EBI) for searches. At the NCBI, the
refseq_protein database is far more informative, and the
NCBI offers organism-specific search pages that improve statistical
significant 100-fold or more. At the EMBL-EBI, the UniProtKB/
Swiss-Prot database provides the most richly functionally annotated
protein sequence dataset available; the EMBL-EBI also offers a
comprehensive set of organism-specific sequence sets.
Searching subsets of databases —Comprehensive protein sequence
databases are very large (more than ten-million sequences) and
Search WWH ::




Custom Search