Biology Reference
In-Depth Information
Table 4
FASTA sequence file formats
FASTA ( > SEQID - comment/sequence)
0
Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
1
EMBL/Uniprot (ID/DE/SQ)
3
GCG (version 8.0) Unix Protein and DNA (compressed)
6
FASTQ (sequence only, quality ignored)
7
Library subset list
10
NCBI Blast ( makeblastdb format
12
MySQL (requires special compilation)
16
Postgres (requires special compilation)
17
Because fasta estimates statistical parameters from the
unrelated sequences in the database, fasta DNA:DNA expectation
values are probably more accurate than blastn values, but
both expectation values are far less reliable than protein:protein
and translated-DNA:protein estimates. For protein:protein and
translated-DNA:protein searches, expectation ( E ()) values
10 3
provide strong evidence for homology (one non-homolog in 1,000
searches). For DNA:DNA searches, E ()-values
<
10 10 are suspect.
>
Library formats —The BLAST programs can either compare two
sequences in FASTA format, or search databases formatted with the
makeblastdb command, which converts a FASTA or ASN.1 for-
mat file into a set of indexed binary sequence files that can be
searched very efficiently. As noted above, current versions of com-
prehensive protein and DNA databases that are pre-formatted for
BLAST searches are available from the NCBI ( ftp.ncbi.nlm.nih.
gov/blast/db ) , and subsets of these databases can be constructed
using the gilist option. But restricting searches to files that can
be downloaded from the NCBI restricts searches to the NCBI
ecosystem ; some bioinformatics resources, such as the Pfam database
and GeneOntology links are more easily accessed using
UniProtKB/Swiss-Prot proteins and accessions, so investigators
running local copies of the BLAST programs will need to run
makeblastdb . Likewise, makeblastdb is required when search-
ing local sequence collections.
The FASTA program can read query and library databases in
popular formats, including FASTA , makeblastdb (BLAST), and
FASTQ , formats (Table 4 ). In addition, the FASTA programs can
read databases comprised of multiple files in different sequence
formats. In addition to conventional “flat-file” ( FASTA , GenBank,
EMBL/Uniprot) and binary makeblastdb /BLAST formats, the
Search WWH ::




Custom Search