Information Technology Reference
In-Depth Information
5.2.2 Desired Features of Allergen Databases
One of the main desired features of an allergen database is the aggregation of all publicly
available allergen-specific information into a comprehensive resource. This aggregation
activity should take note of the following points:
1. The database should aim to be as comprehensive as possible. In practice, the
creation of a one-stop resource for all allergen information is a nontrivial task.
There are already allergen databases that cater to specific needs. Besides, it
would require huge efforts and resources to create and maintain a comprehensive
database that only a few groups could afford.
2. The records contained in the database should be nonredundant and steps should
be taken to ensure this. Redundancy is leading to over- and underrepresentation
of data that can cause errors in the allergen analyses. This is particularly
important if the records are used as training sets for allergenicity prediction.
Moreover, redundancy leads to false estimates of true known allergens. Sequence
similarity methods like BLAST (Altschul, Gish, Miller, Myers, and Lipman
1990) can be effectively used to reduce sequence redundancy by searching for
similar sequence records.
3. Each source database contains different types of biological data necessitating the
design of a common data format that can encompass all available information.
4. The fields contained in the records should be useful for allergen researchers.
Therefore, the design of the record format should take into account the expected
usage. Some of the common fields required include nucleotide sequence, protein
sequence, literature references, and 3D protein structure.
5. As far as possible the allergen names should comply with the nomenclature
(King, Hoffman, Lowenstein, Marsh, Platts-Mills, and Thomas 1994) set out by
the Allergen Nomenclature subcommittee of the IUIS (International Union of
Immunological Societies). Allergens contained in the IUIS allergen list should be
used with its official names to prevent naming conflicts.
6. The use of multiple source databases may lead to conflicting data. Manual
curation would then be required to resolve these conflicts.
7. There is a need to update the allergen database whenever there are changes or
updates in the source databases. The propagation of information from the source
databases to the specialized allergen databases ensures that the database is
current.
8. Some allergen information is only present in the literature and the lack of a
structured form of literature data necessitates the manual extraction of this
information. This requires large amounts of time and effort.
9. The source databases may contain errors that have to be validated. In most cases,
the validation has to be done manually. Again, like information extraction from
the literature, this requires both time and effort.
In view of these factors, the aggregation process should be performed as a two-step
process. The first step would be to aggregate the information present in the source
databases to a format that encompasses all the required and useful fields. As far as
Search WWH ::




Custom Search