Biology Reference
In-Depth Information
can be very redundant (the largest protein families in
refseq_-
protein
have more than 30,000 members). Thus, searching rep-
resentative portions of sequence databases, or focussing on
taxonomically close sequences, can be far more efficient than
searching complete datasets. Both the BLAST and FASTA pro-
grams provide options for searching database subsets. The BLAST
programs accept the
-gilist
option, which specifies the specific
gi
numbers to be searched in a much larger database. This option
requires that the database being searched contains
gi
numbers (all
the databases available from the NCBI do, but databases from the
EMBL-EBI and UniProt do not). The
-gilist
option provides a
powerful tool for searching sequences from selected representative
organisms, as the NCBI Entrez site makes it very easy to download
a list of sequences from an organism, or the results of an Entrez
query. For example, searching Entrez protein sequences with the
query:
srcdb_refseq[prop] AND txid9606[orgn]
provides a list of 35,615
refseq
protein sequences from human
(
txid9606
). The sequences can be downloaded in FASTA format
from the search result page, but it is much easer to simply download
the
GI List
to a file, and then use
blastp
or
blastx
with the
-
gilist
option. The
gilist
option is available after any NCBI/
Entrez query into protein and DNA databases; one can just as easily
for all proteins with the term “glutathione transferase” in their
name.
The FASTA programs provide multiple options for searching
subsets of larger sequence databases. Unlike BLAST, which only
searches one database format (BLAST
makeblastdb
format), the
FASTA programs can search sequences in BLAST, FASTA, and
several other less widely used formats. In addition, FASTA can
search protein sequences stored in a MySQL or PostgreSQL rela-
tional database, or subsets of sequence databases defined by lists of
GI numbers or accession identifiers. With the FASTA programs, the
format of the sequence database is specified as part of its name; by
default a FASTA format database is searched (Table
4
). The FASTA
version of the
gilist
option (format 10, library subset list) works
both with NCBI sequence databases (which have a
gi
number) and
with EMBL-EBI and Uniprot databases, which use sequence
identifiers or accessions. The FASTA library subset list can use
either numbers or strings to identify library sequence subsets.
The FASTA programs can also search portions of a sequence
database by using
SELECT
statements on a MySQL or PostgreSQL
database. Format 16 and 17 files provide SQL select statements for
getting the complete set of sequences, getting individual sequence
entries, and getting entry annotations.