Information Technology Reference
In-Depth Information
5.2.4 Pitfalls of Current Allergen Databases
Current allergen databases fulfill most of the desired features, except for the
availability of a download feature. Although it is possible to parse HTML pages or
Excel files to extract the information, it is rather error-prone. HTML pages are
structured, but most of the structures are used for describing the appearance rather
than the type of content. As a result, bioinformaticians have to re-create the allergen
datasets from scratch rather than using existing datasets. Not only is it time
consuming and repetitive, it also precludes the creation of a standard set of data for
the development of new bioinformatics methods and analysis.
The lack of a standard set of data means that developed methods and analysis results
cannot be easily compared to one other, thus hindering the overall progress of
development. In contrast, downloadable dataset formats like that provided by GenBank
efficiently support bioinformaticians in developing new analysis methods. The adoption
of such features by the allergen databases would enhance the bioinformatic tool
development for allergen research. The allergen databases could then serve as a platform
for the development of new methods and large-scale analysis.
5.3 Allergenicity Prediction
The holy grail of applying bioinformatics to allergen research is the prediction
of allergenicity. Accurate prediction of allergenicity is likely to improve the
allergenicity assessment of recombinant proteins, thereby lowering the allergenicity
testing cost of recombinant proteins. Considering the spread of recombinant protein
use in food, medications, and everyday items, the impact of predictive methods is
expected to be huge.
Predictive methods are often compared on the basis of their precision and recall.
Precision is the ability of the method to correctly predict true allergens among the
predicted allergens. Precision is usually expressed as a percentage of the correctly
predicted allergens over all the predicted allergens. Recall, on the other hand, is the
ability of the method to detect for allergens in the test set. Recall is expressed as the
percentage of correct predicted allergens over all the allergens in the test set. The
equations for precision and recall are provided below. A high precision would mean
that any predicted allergen is likely to be a true allergen while a high recall means
that the method is able to correctly predict a large portion of the allergens in the test
set. In practice, a trade-off is usually required as it is not possible for one to get both
high precision and high recall.
tp
tp
precision
=
,
recall
=
(2)
tp
+
fp
tp
+
fn
where tp is true positive (a correctly predicted allergen), fp is false positive (a non-
allergen that has been wrongly predicted to be an allergen) and fn is false negative
(an allergen wrongly predicted to be a nonallergen).
Search WWH ::




Custom Search