Information Technology Reference
In-Depth Information
maintaining the high quality of the database requires careful sequence analysis and detailed
annotation of every entry. This was, and still is, a major rate-limiting step. We did not wish
to relax the editorial standards of Swiss-Prot and there was a limit to how much the
annotation procedures could be accelerated. Yet it was vital to make new sequences
available as quickly as possible. To address this concern, we introduced in 1996 TrEMBL
(Translation of EMBL). TrEMBL consists of computer-annotated entries derived from the
translation of all coding sequences in the EMBL database, except for those already included
in Swiss-Prot. TrEMBL is therefore a complement to Swiss-Prot and sequence entries only
move out from TrEMBL and enter Swiss-Prot after having been manually curated by an
annotator.
From 1996 to the end of 2003, Swiss-Prot grew by 83,000 sequences to reach a total
of 140,000 entries. In this period of time, TrEMBL grew from the 86,000 entries in its first
release to about 1.1 million entries!
2. What makes Swiss-Prot special
2.1 Aiming for the perfect sequence
Even if it may be obvious to many of its users, it is important to restate that Swiss-Prot is a
corpus of knowledge centred on protein sequences. As it will be apparent in the following
sections of this article, we add many layers of information around the sequence data, yet
most of that information is in one way or another dependent on the sequence. It is therefore
important to capture and to represent the most correct sequence. This is an important aspect
of the work of Swiss-Prot that escapes the notice of most of its users.
The overwhelming majority (>99%) of the sequence data represented in Swiss-Prot
originates from the translation of nucleotide sequences submitted to the
EMBL/Genbank/DDBJ database. Only a very small proportion of the sequences are
obtained directly at the amino-acid level using Edman degradation or mass spectrometry.
This situation already existed in 1986. What happened since was obviously an enormous
quantitative increase in the amount of nucleotide sequence data, but also, more relevant to
our quest toward quality, a significant increase in nucleotide sequence quality and a
sociological change in the breakdown of the originators of sequence data. The increase in
sequence quality is mainly due to the growing use of very sophisticated automated
sequencing machines. In 1986, most nucleotide sequences which were submitted to the
DNA databases originated from individual laboratories that were sequencing a single gene
or a small region of a genome. Today, the biggest (in terms of quantity) contributors are
major sequencing centres that either provide complete genomic sequences or massive
amounts of data from full-length cDNAs.
As we depend on primary sequence data that has been submitted to the nucleotide
sequence databases, it would seem at first glance that there is not really anything we can do
to improve the quality of the derived protein sequences. This is far from being true, and in
fact there are many things we can do by comparing sequences. Sequence comparison is
essential to the process of creating or updating a Swiss-Prot entry. One needs to remember
that Swiss-Prot is a non-redundant database. What this means is that we took the decision
from the very beginning to merge the protein sequences from the same organism
originating from the same gene. Thus we are often faced with many complete or partial
sequences that need to be merged together and whose discrepancies have to be taken into
account. Sequence discrepancies are annotated with the feature (FT) keys CONFLICT,
VARIANT, MUTAGEN or VARSPLIC. The FT key VARIANT is used to describe
polymorphisms and disease mutations, MUTAGEN for experimentally altered sites and
Search WWH ::




Custom Search