Information Technology Reference
In-Depth Information
CONFLICT for sequence differences of any other reason. Insertions or gaps within
alignments of otherwise identical sequences are usually due to alternative splicing events,
which are annotated using the FT key VARSPLIC.
Thus sequence comparisons can already help us in determining what is the most
correct sequence. This is especially true in organisms that are the focus of many sequencing
efforts. For example, we currently have an average of 3.7 independent sequence reports
(cDNA or genomic DNA) for each human protein. Such a redundancy in the nucleotide
sequence database helps flagging potential sequencing errors. Further errors can be found
when comparing orthologous and paralogous sequences across species. The relevance of
such approaches is increasing as more and more full genome sequences are becoming
available.
One of the advantages of comparing many sequences is the detection of probable
frameshift errors. They stand up in multiple protein sequence alignments as locally
divergent regions. If the divergence can be explained at the nucleotide level by the insertion
or deletion of a single nucleotide, it is likely (but not certain) that it is due to a sequencing
error. The total number of potential frameshift errors that were corrected by Swiss-Prot
annotators is difficult to estimate as it often happens that incorrect DNA sequences are later
resubmitted by the original authors, correcting sequencing errors, generally by taking into
account the correction made in the corresponding Swiss-Prot entries. In the current release
we have 1% of the entries that are flagged with at least one potential frameshift error in one
of the cross-referenced nucleotide sequence entries.
In many cases, the N-terminal initiation sites of bacterial or archaeal genes or the
exon/intron boundaries of eukaryotic genes are incorrectly predicted. It is important to note
that these predictions are of a very heterogeneous quality and to recognise that not all
sequencing centres produce the same level of quality in terms of both sequences and of
protein-coding gene predictions. Swiss-Prot annotators are aware of this heterogeneity and
know what data can be more or less trusted. We currently observe that in 7.1% of our
entries we disagree with the translation provided by the submitter.
It often happens that annotators have to translate, from a nucleotide entry, protein
sequences that have been overlooked by the original submitters. Currently we have 2.5% of
our entries that contain such translations.
Finally, the work of the Swiss-Prot annotators is also to reject putative protein
sequences which are obviously bogus, either because they originate from a pseudogene or
because they were incorrectly predicted either from non-coding DNA or a wrong open
reading frame.
If you take all the above factors and tasks into consideration, you can see why we
believe that the correction of amino-acid sequences is an important part of the annotation
process, and that it is far from being trivial to achieve. This is not necessarily apparent to
the user, but it is one of the reasons why Swiss-Prot has always been considered as the
reference database for protein sequences. Of course the drawback of such an approach is
that it is time-consuming and can only be applied to manually annotated entries. Such an
approach can consequently not be applied to TrEMBL, where the represented protein
sequences are those that have been indicated by the submitters of the original nucleotide
sequence entry. It would therefore be important to develop semi-automatic systems that
allow some aspects of sequence correction to be applied to TrEMBL.
2.2 Extracting information from the literature
Fifteen years ago, Swiss-Prot annotators typically went through the following process: they
photocopied all relevant papers from the reference list of the entry they were annotating.
Search WWH ::




Custom Search