Biology Reference
In-Depth Information
Table 4
FASTA sequence file formats
FASTA (
>
SEQID
- comment/sequence)
0
Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
1
EMBL/Uniprot (ID/DE/SQ)
3
GCG (version 8.0) Unix Protein and DNA (compressed)
6
FASTQ (sequence only, quality ignored)
7
Library subset list
10
NCBI Blast (
makeblastdb
format
12
MySQL (requires special compilation)
16
Postgres (requires special compilation)
17
Because
fasta
estimates statistical parameters from the
unrelated sequences in the database,
fasta
DNA:DNA expectation
values are probably more accurate than
blastn
values, but
both expectation values are far less reliable than protein:protein
and translated-DNA:protein estimates. For protein:protein and
translated-DNA:protein searches, expectation (
E
()) values
10
3
provide strong evidence for homology (one non-homolog in 1,000
searches). For DNA:DNA searches,
E
()-values
<
10
10
are suspect.
>
Library formats
—The BLAST programs can either compare two
sequences in
FASTA
format, or search databases formatted with the
makeblastdb
command, which converts a
FASTA
or
ASN.1
for-
mat file into a set of indexed binary sequence files that can be
searched very efficiently. As noted above, current versions of com-
prehensive protein and DNA databases that are pre-formatted for
BLAST searches are available from the NCBI (
ftp.ncbi.nlm.nih.
gov/blast/db
)
, and subsets of these databases can be constructed
using the
gilist
option. But restricting searches to files that can
be downloaded from the NCBI restricts searches to the NCBI
ecosystem
; some bioinformatics resources, such as the Pfam database
and GeneOntology links are more easily accessed using
UniProtKB/Swiss-Prot proteins and accessions, so investigators
running local copies of the BLAST programs will need to run
makeblastdb
. Likewise,
makeblastdb
is required when search-
ing local sequence collections.
The FASTA program can read query and library databases in
popular formats, including
FASTA
,
makeblastdb
(BLAST), and
FASTQ
, formats (Table
4
). In addition, the FASTA programs can
read databases comprised of multiple files in different sequence
formats. In addition to conventional “flat-file” (
FASTA
, GenBank,
EMBL/Uniprot) and binary
makeblastdb
/BLAST formats, the