Information Technology Reference
In-Depth Information
Table 1 provides a partial description of where and how Swiss-Prot either makes
use of existing controlled vocabularies or has developed such corpora.
Protein names (DE
line)
We use as primary name the ones that seem to be the most appropriate according to the function of a
protein, to the nomenclature adopted by the specialists in that field or to the gene name, etc. We keep all
synonyms used in publications and authors' submissions except if they are misleading. Furthermore we
transfer the same name to the orthologs of related organisms.
Gene names (GN line)
Whenever a nomenclature committee (for example HUGO, FlyBase, etc.) provides “official” gene names
for a given organism, we try to enforce their choice of gene names, yet keeping what authors originally
provided as synonyms.
Species names (OS
line)
The species names used in Swiss-Prot are listed in a document (speclist.txt). From the very beginning,
care has been taken to store not only the official (scientific) name, but also the most useful common
names and synonyms.
Species taxonomy (OC
and OX lines)
We make use of the taxonomy compiled by NCBI which is used by most major biomolecular sequence
databases.
Organelle (OG line)
We standardize plasmid name usage and list them in a Swiss-Prot document (plasmid.txt).
Reference
comments
Among other uses, the RC line allows to indicate the tissue from which a protein originates (TISSUE), or
the strain (STRAIN). The tissues are reported in the file tisslist.txt and the strains in strains.txt. Both lists
contain indications on synonyms.
(RC line)
Reference authors (RA
line)
As far as possible, the names of authors are stored according to consistent rules. For example the German
Umlaut is replaced by an 'e' following the vowel on which the Umlaut was perched, the hyphen is
retained between two initials (which is removed in Medline/PubMed), we keep all the initials (even
where PubMed only keeps two) and we often correct misspelling in author names!
Reference location (RL
line)
Journal abbreviations in Swiss-Prot follow whenever possible those used by the National Library of
Medicine (NLM). We provide a journal list (jourlist.txt) that, in addition to the journal names and
abbreviations, also provides ISSN (International Standard Serial Number), CODEN number, publishers
and journal home page web addresses.
Comments (CC line)
The CC lines mainly contain free text comments classified under 24 different topics. If a piece of
information cannot be classified under a specific topic, it is put under 'MISCELLANEOUS'.
However, with time, the information in the CC lines is becoming less 'free' so to speak, and more and
more CC line topics are subjected to controlled vocabularies. For example, this is the case of the
'CATALYTIC ACTIVITY' topic whose text is taken from the ENZYME database [10] for all known
enzymes, referred to by their EC (Enzyme Classification) numbers in the DE lines. We are currently
standardizing the use of the 'COFACTOR', 'PATHWAY' and 'SUBCELLULAR LOCATION' topics.
Keywords (KW line)
Keywords were one of the first sets of controlled vocabulary in Swiss-Prot. They were introduced to
summarize the content of an entry and to group entries according to different aspects related to biological
processes, molecular function, subcellular location, domains, ligands, sequence modifications and
diseases. We provide a keyword list (keywlist.txt) that is being superseded by a dictionary that provides
the precise definition of the usage of a keyword in the context of Swiss-Prot. The dictionary also includes
synonyms, groups keywords into categories and provides a mapping between Swiss-Prot keywords and
GO terms (see 3.5.2).
Feature table (FT line)
We are currently establishing a controlled vocabulary for the features describing posttranslational
modifications (PTMs) [11]. We are also building a PTM database to store, for each type of modification,
information such as the general description, target(s), chemical formula, subcellular localization of
modified site, enzyme(s) carrying out the PTM, etc. Domain-type (DOMAIN, REPEAT, DNA_BIND,
ZN_FING, etc.) feature descriptions are also standardized across all of Swiss-Prot.
Sequence
The sequences are stored in the one-letter code adopted by the commission on Biochemical
Nomenclature of the IUPAC-IUBMB.
Table 1: Standardization efforts and use of existing or in-house controlled vocabularies in
Swiss-Prot, listed by line type.
This list, even if incomplete, is impressive; yet it does not capture the whole
complexity of issues surrounding the use of nomenclature and controlled vocabularies in
the life sciences. We need to state here that if physicists or chemists behaved like biologists
do, we would probably live in a world without computers or plastic (this may sound like an
attractive proposition to some!). Life scientists do not receive, during their training, the
perception of the importance of following nomenclature rules. Yet, they are the first to
Search WWH ::




Custom Search