Biology Reference
In-Depth Information
in a cloud computing environment [ 11 ]. Running a local copy of
the BLAST or FASTA programs provides the researcher with some
control over the time required for the analysis, ensures that
the searches are reproducible (the version of the program and
reference database will remain constant), and allows searches to
be performed against the most appropriate database for the
research question being addressed. Moreover, web implementa-
tions of the search programs may impose output constraints that
can be removed in local implementations. The NCBI BLAST pro-
grams can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/
executables/LATEST . The FASTA programs can be downloaded
from http://faculty.virginia.edu/wrpearson/fasta or ftp://ftp.ebi.
ac.uk/pub/software/unix/fasta .
Sequence databases —The biggest challenge in running the pro-
grams locally is keeping the sequence databases current. While the
programs change slowly, protein and DNA sequence databases
change daily or weekly. Even when the download process is scripted
and runs automatically, the downloading and reformatting process
is time consuming and can fail unexpectedly. The scripting expertise
required to keep sequence comparison programs and databases up-
to-date may be better used to build scripted interfaces to the NCBI
and EMBL-EBI web resources.
Comprehensive protein and DNA sequence databases can be
downloaded from the NCBI ( ftp://ftp.ncbi.nlm.nih.gov/blast/
db/ ) and the EMBL-EBI ( ftp://ftp.ebi.ac.uk/pub/databases/ ) .
Selection of the appropriate database for similarity searching is dis-
cussed below (Subheading 4.1 ), but the most sensitive and efficient
searches are performed against protein databases, which are relatively
compact (
20 GB for the largest protein sets). DNA sequence
datasets are many orders of magnitude larger and highly redundant;
except in rare cases searches should be performed against protein sets,
or selected DNA subsets should be found.
<
Similarity searching on the “Cloud” —Recently, comprehensive sets
of bioinformatics programs, including BLAST and FASTA, have
been packaged as instances for the Amazon Web Services cloud
computing environment [ 11 ]. This packaging makes it easier to
cheaply set up the computing infrastructure necessary for a large-
scale analysis project, as programs are collected from diverse
sources, installed and tested. The Cloud BioLinux environment
also provides access to many model organism genomes, but the
focus seems to be on DNA read mapping; few protein sequence
databases are available within the Amazon Web Services environ-
ment. Using the Bio-Linux environment is more convenient than
downloading and installing dozens of bioinformatics programs, but
access to current protein sequence databases is much easier using
the Web search interfaces at the NCBI and EMBL-EBI.
Search WWH ::




Custom Search