BLAST and FASTA Similarity Searching for Multiple Sequence Alignment - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

in a cloud computing environment [ 11 ]. Running a local copy of

the BLAST or FASTA programs provides the researcher with some

control over the time required for the analysis, ensures that

the searches are reproducible (the version of the program and

reference database will remain constant), and allows searches to

be performed against the most appropriate database for the

research question being addressed. Moreover, web implementa-

tions of the search programs may impose output constraints that

can be removed in local implementations. The NCBI BLAST pro-

grams can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/

executables/LATEST . The FASTA programs can be downloaded

from http://faculty.virginia.edu/wrpearson/fasta or ftp://ftp.ebi.

ac.uk/pub/software/unix/fasta .

Sequence databases —The biggest challenge in running the pro-

grams locally is keeping the sequence databases current. While the

programs change slowly, protein and DNA sequence databases

change daily or weekly. Even when the download process is scripted

and runs automatically, the downloading and reformatting process

is time consuming and can fail unexpectedly. The scripting expertise

required to keep sequence comparison programs and databases up-

to-date may be better used to build scripted interfaces to the NCBI

and EMBL-EBI web resources.

Comprehensive protein and DNA sequence databases can be

downloaded from the NCBI ( ftp://ftp.ncbi.nlm.nih.gov/blast/

db/ ) and the EMBL-EBI ( ftp://ftp.ebi.ac.uk/pub/databases/ ) .

Selection of the appropriate database for similarity searching is dis-

cussed below (Subheading 4.1 ), but the most sensitive and efficient

searches are performed against protein databases, which are relatively

compact (

20 GB for the largest protein sets). DNA sequence

datasets are many orders of magnitude larger and highly redundant;

except in rare cases searches should be performed against protein sets,

or selected DNA subsets should be found.

<

Similarity searching on the “Cloud” —Recently, comprehensive sets

of bioinformatics programs, including BLAST and FASTA, have

been packaged as instances for the Amazon Web Services cloud

computing environment [ 11 ]. This packaging makes it easier to

cheaply set up the computing infrastructure necessary for a large-

scale analysis project, as programs are collected from diverse

sources, installed and tested. The Cloud BioLinux environment

also provides access to many model organism genomes, but the

focus seems to be on DNA read mapping; few protein sequence

databases are available within the Amazon Web Services environ-

ment. Using the Bio-Linux environment is more convenient than

downloading and installing dozens of bioinformatics programs, but

access to current protein sequence databases is much easier using

the Web search interfaces at the NCBI and EMBL-EBI.

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home