Information Technology Reference
In-Depth Information
GenBank: The NCBI Nucleotide Sequence
Database
Ilene MIZRACHI
National Center for Biotechnology Information, National Library of Medicine, Building
38A, Bethesda, MD 20894, USA
Abstract. The GenBank sequence database is an annotated collection of all publicly
available nucleotide sequences and their protein translations. This database is
produced at National Center for Biotechnology Information (NCBI) as part of an
international collaboration with the European Molecular Biology Laboratory
(EMBL) Data Library from the European Bioinformatics Institute (EBI) and the
DNA Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences
produced in laboratories throughout the world from more than 115,000 distinct
organisms. GenBank continues to grow at an exponential rate, doubling every 10
months. Release 142, produced in June 2004, contained over 40.3 billion nucleotide
bases in more than 35.5 million sequences. GenBank is built by direct submissions
from individual laboratories, as well as from bulk submissions from large-scale
sequencing centers. Direct submissions are made to GenBank using BankIt
[http://www.ncbi.nlm.nih.gov/BankIt/], which is a Web-based form, or the stand-
alone submission program, Sequin 1 . Upon receipt of a sequence submission, the
GenBank staff assigns an Accession number to the sequence and performs quality
assurance checks. The submissions are then released to the public database, where
the entries are retrievable by Entrez or downloadable by FTP. Bulk submissions of
Expressed Sequence Tag (EST), Sequence Tagged Site (STS), Genome Survey
Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are most
often submitted by large-scale sequencing centers. The GenBank direct submissions
group also processes complete microbial genome sequences.
1. History
Initially, GenBank was built and maintained at Los Alamos National Laboratory (LANL).
In the early 1990s, this responsibility was awarded to NCBI through congressional
mandate. NCBI undertook the task of scanning the literature for sequences and manually
typing the sequences into the database. Staff then added annotation to these records, based
upon information in the published article. Scanning sequences from the literature and
placing them into GenBank is now a rare occurrence. Nearly all of the sequences are now
deposited directly by the labs that generate the sequences. This is attributable to, in part, a
requirement by most journal publishers that nucleotide sequences are first deposited into
publicly available databases (DDBJ/EMBL/GenBank) so that the Accession number can be
cited and the sequence can be retrieved when the article is published. NCBI began
accepting direct submissions to GenBank in 1993 and received data from LANL until 1996.
Currently, NCBI receives and processes about 25,000 direct submission sequences per
month, in addition to the approximately 700,000 bulk submissions that are processed
automatically.
1 [http://www.ncbi.nlm. nih.gov/Sequin/index.html]
Search WWH ::




Custom Search