Information Technology Reference
In-Depth Information
together but will be displayed in Entrez [http://www.ncbi.nlm.nih.gov/Entrez/] as single
records. Alternatively, by using the Sequin submission tool 3 , a submitter can specify that
several sequences are biologically related. Such sequences are classified as environmental
sample sets, population sets, phylogenetic sets, mutation sets, or segmented sets. Each
sequence within a set is assigned its own Accession number and can be viewed
independently in Entrez. However, with the exception of segmented sets, each set is also
indexed within the PopSet division of Entrez, thus allowing scientists to view the
relationship between the sequences. What defines a set? Environmental sample, population,
phylogenetic, and mutation sets all contain a group of sequences that spans the same gene
or region of the genome. Environmental samples are derived from a group of unclassified
or unknown organisms. A population set contains sequences from different isolates of the
same organism. A phylogenetic set contains sequences from different organisms that are
used to determine the phylogenetic relationship between them. Sequencing multiple
mutations within a single gene gives rise to a mutation set. All sets, except segmented sets,
may contain an alignment of the sequences within them and might include external
sequences already present in the database. In fact, the submitter can begin with an existing
alignment to create a submission to the database using the Sequin submission tool.
Currently, Sequin accepts FASTA+GAP, PHYLIP, MACAW, NEXUS Interleaved, and
NEXUS Contiguous alignments. Submitted alignments will be displayed in the PopSet
section of Entrez. Segmented sets are a collection of noncontiguous sequences that cover a
specified genetic region. The most common example is a set of genomic sequences
containing exons from a single gene where part or all of the intervening regions have not
been sequenced. Each member record within the set contains the appropriate annotation,
exon features in this case. However, the mRNA and CDS will be annotated as joined
features across the individual records. Segmented sets themselves can be part of an
environmental sample, population, phylogenetic, or mutation set.
5. Bulk Submissions: High-Throughput Genomic Sequence (HTGS)
HTGS entries are submitted in bulk by genome centers, processed by an automated system,
and then released to GenBank. Currently, about 30 genome centers are submitting data for a
number of organisms, including human, mouse, rat, rice, and Plasmodium falciparu m, the
malaria parasite. HTGS [http://www.ncbi.nlm.nih.gov/HTGS/] data are submitted in four
phases of completion: 0, 1, 2, and 3. Phase 0 sequences are one-to-few reads of a single
clone and are not usually assembled into contigs. They are low-quality sequences that are
often used to check whether another center is already sequencing a particular clone. Phase 1
entries are assembled into contigs that are separated by sequence gaps, the relative order
and orientation of which are not known . Phase 2 entries are also unfinished sequences that
may or may not contain sequence gaps. If there are gaps, then the contigs are in the correct
order and orientation. Phase 3 sequences are of finished quality and have no gaps. For each
3 [http://www.ncbi.nlm.nih.gov/Sequin/index.html]
Search WWH ::




Custom Search