GenBank: The NCBI Nucleotide Sequence Database - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

them; therefore, they lack feature annotation. STSs [http://www.ncbi.nlm.nih.gov/dbSTS/]

are short genomic landmark sequences (1). They are operationally unique in that they are

specifically amplified from the genome by PCR amplification. In addition, they define a

specific location on the genome and are, therefore, useful for mapping.

GSS [http://www.ncbi.nlm.nih.gov/dbGSS/]s are also short sequences but are derived from

genomic DNA, about which little is known. They include, but are not limited to, single-pass

GSSs, BAC ends, exon-trapped genomic sequences, and AluPCR sequences. EST, STS,

and GSS sequences reside in their respective divisions within GenBank, rather than in the

taxonomic division of the organism. The sequences are maintained within GenBank in the

dbEST, dbSTS, and dbGSS databases.

11. Submitting Data to dbEST, dbSTS, or dbGSS

Because of the large numbers of sequences that are submitted at once, dbEST, dbSTS, and

dbGSS entries are stored in relational databases where information that is common to all

sequences can be shared. Submissions consist of several files containing the common

informa-tion, plus a file of the sequences themselves. The three types of submissions have

different requirements, but all include a Publication file and a Contact file. See the dbEST

[http://www.ncbi. nlm.nih.gov/dbEST/], dbSTS [http://www.ncbi.nlm.nih.gov/dbSTS/], and

dbGSS [http://www.ncbi. nlm.nih.gov/dbGSS/] pages for the specific requirements for each

type of submission. In general, users generate the appropriate files for the submission type

and then email the files to batch-sub@ncbi.nlm.nih.gov. If the files are too big for email,

they can be deposited into a FTP account. Upon receipt, the files are examined by a

GenBank annotator, who fixes any errors when possible or contacts the submitter to request

corrected files. Once the files are satisfactory, they are loaded into the appropriate database

and assigned Accession numbers. Additional formatting errors may be detected at this step

by the data-loading software, such as double quotes anywhere in the file or invalid

characters in the sequences. Again, if the annotator cannot fix the errors, a request for a

corrected submission is sent to the user. After all problems are resolved, the entries are

loaded into GenBank.

12. Bulk Submissions: HTC and FLIC

HTC records are High-Throughput cDNA/mRNA submissions that are similar to ESTs but

often contain more information. For example, HTC entries often have a systematic gene

name (not necessarily an official gene name) that is related to the lab or center that

submitted them, and the longest open reading frame is often annotated as a coding region.

FLIC records, Full-Length Insert cDNA, contain the entire sequence of a cloned

cDNA/mRNA. Therefore, FLICs are generally longer, and sometimes even full-length,

mRNAs. They are usually annotated with genes and coding regions, although these may be

lab systematic names rather than functional names.

13. HTC Submissions

HTC entries are usually generated with Sequin [http://www.ncbi.nlm.nih.gov/

Sequin/index.html] or tbl2asn [http://www.ncbi.nlm.nih.gov/Sequin/table.html], and the

files are emailed to gb-sub@ncbi.nlm.nih.gov. Larger files may be submitted by

SequinMacrosend [www.ncbi.nlm.nih. gov/LargeDirSubs/ dir_submit.cgi].. HTC entries

Search WWH ::

Custom Search

Home