Biology Reference
In-Depth Information
can be very redundant (the largest protein families in refseq_-
protein have more than 30,000 members). Thus, searching rep-
resentative portions of sequence databases, or focussing on
taxonomically close sequences, can be far more efficient than
searching complete datasets. Both the BLAST and FASTA pro-
grams provide options for searching database subsets. The BLAST
programs accept the -gilist option, which specifies the specific
gi numbers to be searched in a much larger database. This option
requires that the database being searched contains gi numbers (all
the databases available from the NCBI do, but databases from the
EMBL-EBI and UniProt do not). The -gilist option provides a
powerful tool for searching sequences from selected representative
organisms, as the NCBI Entrez site makes it very easy to download
a list of sequences from an organism, or the results of an Entrez
query. For example, searching Entrez protein sequences with the
query:
srcdb_refseq[prop] AND txid9606[orgn]
provides a list of 35,615 refseq protein sequences from human
( txid9606 ). The sequences can be downloaded in FASTA format
from the search result page, but it is much easer to simply download
the GI List to a file, and then use blastp or blastx with the -
gilist option. The gilist option is available after any NCBI/
Entrez query into protein and DNA databases; one can just as easily
for all proteins with the term “glutathione transferase” in their
name.
The FASTA programs provide multiple options for searching
subsets of larger sequence databases. Unlike BLAST, which only
searches one database format (BLAST makeblastdb format), the
FASTA programs can search sequences in BLAST, FASTA, and
several other less widely used formats. In addition, FASTA can
search protein sequences stored in a MySQL or PostgreSQL rela-
tional database, or subsets of sequence databases defined by lists of
GI numbers or accession identifiers. With the FASTA programs, the
format of the sequence database is specified as part of its name; by
default a FASTA format database is searched (Table 4 ). The FASTA
version of the gilist option (format 10, library subset list) works
both with NCBI sequence databases (which have a gi number) and
with EMBL-EBI and Uniprot databases, which use sequence
identifiers or accessions. The FASTA library subset list can use
either numbers or strings to identify library sequence subsets.
The FASTA programs can also search portions of a sequence
database by using SELECT statements on a MySQL or PostgreSQL
database. Format 16 and 17 files provide SQL select statements for
getting the complete set of sequences, getting individual sequence
entries, and getting entry annotations.
Search WWH ::




Custom Search