BLAST and FASTA Similarity Searching for Multiple Sequence Alignment - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

can be very redundant (the largest protein families in refseq_-

protein have more than 30,000 members). Thus, searching rep-

resentative portions of sequence databases, or focussing on

taxonomically close sequences, can be far more efficient than

searching complete datasets. Both the BLAST and FASTA pro-

grams provide options for searching database subsets. The BLAST

programs accept the -gilist option, which specifies the specific

gi numbers to be searched in a much larger database. This option

requires that the database being searched contains gi numbers (all

the databases available from the NCBI do, but databases from the

EMBL-EBI and UniProt do not). The -gilist option provides a

powerful tool for searching sequences from selected representative

organisms, as the NCBI Entrez site makes it very easy to download

a list of sequences from an organism, or the results of an Entrez

query. For example, searching Entrez protein sequences with the

query:

srcdb_refseq[prop] AND txid9606[orgn]

provides a list of 35,615 refseq protein sequences from human

( txid9606 ). The sequences can be downloaded in FASTA format

from the search result page, but it is much easer to simply download

the GI List to a file, and then use blastp or blastx with the -

gilist option. The gilist option is available after any NCBI/

Entrez query into protein and DNA databases; one can just as easily

for all proteins with the term “glutathione transferase” in their

name.

The FASTA programs provide multiple options for searching

subsets of larger sequence databases. Unlike BLAST, which only

searches one database format (BLAST makeblastdb format), the

FASTA programs can search sequences in BLAST, FASTA, and

several other less widely used formats. In addition, FASTA can

search protein sequences stored in a MySQL or PostgreSQL rela-

tional database, or subsets of sequence databases defined by lists of

GI numbers or accession identifiers. With the FASTA programs, the

format of the sequence database is specified as part of its name; by

default a FASTA format database is searched (Table 4 ). The FASTA

version of the gilist option (format 10, library subset list) works

both with NCBI sequence databases (which have a gi number) and

with EMBL-EBI and Uniprot databases, which use sequence

identifiers or accessions. The FASTA library subset list can use

either numbers or strings to identify library sequence subsets.

The FASTA programs can also search portions of a sequence

database by using SELECT statements on a MySQL or PostgreSQL

database. Format 16 and 17 files provide SQL select statements for

getting the complete set of sequences, getting individual sequence

entries, and getting entry annotations.

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home