Information Technology Reference
In-Depth Information
2.6 Evolution of entry structure and format
Since its creation in 1986, the basic structure of a Swiss-Prot entry has not changed
significantly. The distinct line types defined by a 2-letter code are generally relevant to all
entries and cover the core data, while the actual protein information is given in the
comment (CC) lines and in the feature table (FT). While the general framework has been
very stable, we have carried out many changes over the years. New line types were
introduced, the structure of existing line types was constantly refined and new sub-fields
(comments topics, feature keys) were added. Such changes are always documented (in
release notes and other documents) and users are warned in advance of pending changes so
that they can adapt their software tools. While the general stability of the Swiss-Prot flat
file format may be seen as a proof of foresight, careful planning and experience, one can
also say that in some respect Swiss-Prot had become a victim of its own success: even the
smallest modification to the flat file format, or the introduction of new fields, needs to be
considered carefully, and it happens that ideas are discarded for the sole reason that “this
will cause the crash of thousands of programs out there…”.
Swiss-Prot and TrEMBL have traditionally been maintained and distributed as flat
files. An inherent problem of flat file databanks is that their maintenance becomes
increasingly difficult when they grow in size and many people are involved in the
production of the data. Since 2002, Swiss-Prot and TrEMBL are also distributed in XML
(http://www.ebi.uniprot.org/support/documents.shtml), the extensible markup language that
makes it possible to define the content of a document separately from its formatting,
making it easy to reuse that content in other applications or for other presentation
environments. XML allows, in contrast to HTML, the authors of a document to create their
own markup tags suiting their needs and allowing to best structure the data. But what is
more, XML allows implementing rules that are not limited to formatting, but can be used to
formulate dependencies. We are also in the process of porting the production of Swiss-Prot
and TrEMBL to a Relational Database Management System. In order to develop the
relational and XML schema, we have designed conceptual data models, using the Unified
Modelling Language (UML) notation, to represent the structure and constraints present in
the data.
In the meantime, until the production copy of Swiss-Prot is managed in a relational
database management system, we still need to introduce certain format changes to the flat
file in order to accommodate more complex concepts. Such changes can be quite
substantial and time-consuming, as they are always introduced in a way that not only new
annotation is performed according to the new format, but all existing entries need to be
converted. As a consequence, this can involve, in addition to the creation of conversion
software, and to the modification of documentation and annotation tools, a lot of manual
cleaning. That we need to embark on such manual cleaning steps is not due to the structure
or the format of the database, but rather to our pathological urge to make sure that all
aspects of Swiss-Prot are self-consistent. Therefore, whenever we introduce a new type of
data, we try as much as possible to update all the entries where such data has some
relevance.
There are many changes we plan to make to the flat file format. For example, in the
near future, we plan to overhaul the format of the GN (gene) line so that it will allow a
more structured representation of the information concerning gene names. The new format
will allow distinguishing official gene name, synonyms, ordered locus name and ORF
names. This change allows a better representation of the complexity of gene and locus
naming schemes.
As we described in the section on automatic annotation (see 2.4.3), it is important to
provide users with a means to track down the origin of all information items in a Swiss-Prot
Search WWH ::




Custom Search