Information Technology Reference
In-Depth Information
organism, the group overseeing the sequencing effort determines the definition of finished
quality.
Phase 0, 1, and 2 records are in the HTG division of GenBank, whereas phase 3
entries go into the taxonomic division of the organism, for example, PRI (primate) for
human. An entry keeps its Accession number as it progresses from one phase to another but
receives a new Accession. Version number and a new gi number each time there is a
sequence change.
6. Submitting Data to the HTG Division
To submit sequences in bulk to the HTG processing system, a center or group must set up
an FTP account by writing to htgs-admin@ncbi.nlm.nih.gov. Submitters frequently use two
tools to create HTG submissions, Sequin [http://www.ncbi.nlm.nih.gov/HTGS/sequininfo.
html] or fa2htgs [http://www.ncbi.nlm.nih.gov/HTGS/fa2htgsinfo.html]. Both of these tools
require FASTA-formatted sequence, i.e., a definition line beginning with a “greater than”
sign (“>”) followed by a unique identifier for the sequence. The raw sequence appears on
the lines after the definition line. For sequences composed of contigs separated by gaps, a
modified FASTA format [http://www. ncbi.nlm.nih.gov/HTGS/sequininfo.html] is used. In
addition, Sequin users must modify the Sequin configuration file so that the HTG genome
center features are enabled.
fa2htgs is a command-line program that is downloaded to the user's computer. The
submitter invokes a script with a series of parameters (arguments) to create a submission. It
has an advantage over Sequin in that it can be set up by the user to create submissions in
bulk from multiple files. Submissions to HTG must contain three identifiers that are used to
track each HTG record: the genome center tag, the sequence name, and the Accession
number. The genome center tag is assigned by NCBI and is generally the FTP account
login name. The sequence name is a unique identifier that is assigned by the submitter to a
particular clone or entry and must be unique within the group's submissions. When a
sequence is first submitted, it has only a sequence name and genome center tag; the
Accession number is assigned during processing. All updates to that entry must include the
center tag, sequence name, and Accession number, or processing will fail.
7. The HTG Processing Pathway
Submitters deposit HTGS sequences in the form of Seq-submit files generated by Sequin,
fa2htgs, or their own ASN.1 dumper tool into the SEQSUBMIT directory of their FTP
account. Every morning, scripts automatically pick up the files from the FTP site and copy
them to the processing [http://www.ncbi.nlm.nih.gov/HTGS/processing.html] pathway, as
well as to an archive. Once processing is complete and if there are no errors in the
submission, the files are automatically loaded into GenBank. The processing time is related
to the number of submissions that day; therefore, processing can take from one to many
hours. Entries can fail HTG processing because of three types of problems: 1. Formatting:
submissions are not in the proper Seq-submit format. 2. Identification: submissions may be
missing the genome center tag, sequence name, or Accession number, or this information is
incorrect. 3. Data: submissions have problems with the data and therefore fail the validator
checks. When submissions fail HTG processing, a GenBank annotator sends email to the
sequencing center, describing the problem and asking the center to submit a corrected entry.
Annotators do not fix incorrect submissions; this ensures that the staff of the submitting
genome center fixes the problems in their database as well. The processing pathway also
Search WWH ::




Custom Search