Information Technology Reference
In-Depth Information
are to carry out technological watch and to develop new methodologies for protein
sequence analysis.
Currently we use software tools (a full list with references is available in the Swiss-
Prot document annbioch.txt) to predict the following sequence features:
x Signal sequences of type 1, type 2 (lipoprotein) and type 3;
x Mitochondrial and plastid targeting sequences;
x Transmembrane domains;
x Coiled coil domains;
x Specific repeats (LRR, TPR, WD, etc.);
x Statistically significant runs of amino acids and regions enriched in particular amino
acids;
x N-glycosylation sites;
x GPI-anchors;
x Sulfation sites;
x N-terminal myristoylation sites.
In addition to the above list, we make extensive use of domain/family databases to
annotate specific domains. In fact the development of the PROSITE [8] database, which
was first released in 1990, was specifically driven by the need to detect and annotate
protein domains. The combined usage of profiles and patterns allows the detection of
domains (profile) and the functional sites within domains (pattern). As mentioned in the
section on cross-references (3.7), there are now many other protein domain databases and
we occasionally make use of most of them to annotate specific domains not yet covered by
PROSITE. The reasons of our preference for PROSITE over other similar databases are
very pragmatic: PROSITE domain descriptors are specifically tailored for their use in the
context of protein sequence annotation in order not to predict overlapping domains. Cut-off
values are selected conservatively to minimise the number of false positives: we prefer to
miss the occurrence of a domain rather than to over-predict its existence.
We believe that the use of the most up-to-date sequence analysis tools is essential to
any protein sequence annotation effort. In addition anyone considering applying such
methods on a large scale needs to develop internal benchmarks so as to objectively judge
the validity and the scope of the methods. In many instances we have observed that the
claims of developers of sequence analysis methods are slightly overblown and that one
obtains unexpected results when using such methods on large and highly heterogeneous
sets of sequences.
2.4.3 Automation: trying to simulate the expertise of annotators
Thanks to genome sequencing efforts, there has been a tremendous rise in the number of
available protein sequences. Yet clearly this is only the beginning and what exists now will
only represent a drop in an ocean of uncharacterised sequences. And there lies both the
problem and a possible solution: on one hand the overwhelming majority of genome-
derived sequences are currently not the target of experimental characterisation and are
probably not going to be so in the next decade. On the other hand we have encapsulated in
Swiss-Prot a tremendous amount of knowledge, some of which is specific to a given
protein, while the majority can be carefully propagated to well defined orthologous
sequences. Automatic annotation is far from being a novel concept. But what we want to
achieve in Swiss-Prot differs from what others expect from such systems. Their aim is to
analyse new genomic sequences and predict a maximum of potential information items so
as to be able to infer hypotheses on the potential biological processes present in the
Search WWH ::




Custom Search