Information Technology Reference
In-Depth Information
organism. Our aim is to make sure that we produce high quality annotation with a minimal
amount of incorrect inferences.
Our first automatic annotation project is called HAMAP [9], which stands for High-
quality Automated and Manual Annotation of microbial Proteomes. In the context of this
project, proteins from complete bacterial and archaeal proteomes, together with the related
plastid proteins, are automatically annotated based on manually created family rules for
complete protein annotation, with template-based feature propagation. Proteins with no
similarity to other proteins in Swiss-Prot, which we call ORFans, undergo an automated
protein sequence analysis procedure that looks for many of the sequence features described
in the preceding section. These features are then automatically annotated according to rules
of consistency and dependency.
We have just developed a second system called Anabelle that strives to annotate not
only ORFans and well-defined proteins, but also any protein with one or more conserved or
functional domains or sites detected by one of the methods carefully selected for their
accuracy by the Swiss-Prot team. The information retrieved from all results is logically
combined according to selection rules and logical rules, thus coming to more trustworthy
conclusions than possible when just looking at one result at a time. Anabelle is integrated in
the annotator's workbench: The automatically pre-selected analysis results are visualized in
a graphical system, from which the annotator can choose the true positive results and easily
generate annotation based on sequence similarity and sequence analysis. Not only does this
speeds up annotation, but it also promotes the consistent transfer of entire information
blocks that logically group together, ensuring the usage of standardised vocabulary and
minimising the probability of errors and typos.
We believe that careful application of rules to produce automatically or semi-
automatically annotated protein entries brings about many advantages for users of Swiss-
Prot. We know that many are apprehensive of the word “automation” and are afraid that we
will drown high-quality manually annotated entries with lower quality “automated” entries.
We are very aware of this danger and are almost paranoid in our effort to ensure that
automatic annotation will produce data of a quality up to that of manual curation. Finally it
must be noted that one of the important changes planned in the Swiss-Prot format (see 2.6)
is very pertinent to this issue, the introduction of “evidence tags” which should allow to
unambiguously flag if an information item has been manually or automatically derived.
2.5 Standardisation and controlled vocabularies
2.5.1 A long tradition of using controlled vocabularies in Swiss-Prot
To allow effective and precise database retrieval and searches, the same concepts need to be
described with the same terms everywhere in the database. Controlled vocabularies or
indexing terms can serve this purpose. A controlled vocabulary is defined as “an organised
list of words and phrases, or notation system, that is used to initially tag content, and then to
find it through navigation or search” (Amy Warner 1 ).
Since its creation, Swiss-Prot has stored information under specific line types many
of which are structured in such a way as to facilitate text searches in the database. Even the
fields that appear to contain unstructured text are often written according to strict guidelines
to ensure consistency. In some cases, lists are made where “preferred” terms are associated
with synonyms, spelling differences, abbreviations, or yet other terms considered as
equivalents.
1 http://www.lexonomy.com/publications/aTaxonomyPrimer.html
Search WWH ::




Custom Search