Information Technology Reference
In-Depth Information
method is unlikely to miss any true allergens (a required feature), but is likely to
generate large amounts of false positives that may very well overwhelm the
laboratory testing capabilities.
5.3.3 Supervised Classification Approaches
Recently, supervised classification approaches have been adopted for allergenicity
prediction (Soeria-Atmadja et al. 2004). The study employed three different supervised
algorithms, namely, the kNN classifier, the Bayesian linear Gaussian classifier, and the
Bayesian quadratic Gaussian classifier. The methods were trained on a set of local
alignments produced by FASTA. The feature vector consists of the alignment length and
score extracted from the best alignment obtained by FASTA. Training data for the study
included both positive and negative datasets.
The results of the study indicate that the Bayesian linear Gaussian classifier was
the best algorithm, being able to detect 77% of the allergens with a false positive rate
of 10%. This was followed by the Bayesian quadratic Gaussian classifier (77% of
allergens detected with a false positive rate of 11%) and the kNN classifier (78% of
allergens detected with a false positive rate of 13%). The algorithms may be tuned
for either high precision or high recall. Tuning the algorithm for high recall would be
critical in a screening procedure as false negatives are far less desirable. By
combining feature vectors obtained using different scoring matrices, better results
were obtained for the Bayesian linear Gaussian classifier allowing it to detect 77% of
the allergens with a false positive rate of 8%.
The results obtained look promising, as they allow for much lower false
positive rates than those possible with the FAO/WHO guidelines. However, as
the method relies on local alignments, conformational epitopes may still present
a challenge.
5.3.4 Expectation Maximization
Allergenicity predictions have also been attempted using MEME (Bailey and Elkan
1994), a motif discovery system employing expectation maximization (Stadler and
Stadler 2003). The study attempts to locate common motifs among allergens and then to
utilize these motifs for allergenicity predictions. The underlying basis is that these
identified motifs are indicators of allergenicity.
The method employs MEME in an iterative manner. First, a dataset of 779
non-redundant allergens was created from public databases. Then MEME was
applied to this dataset and the most significant motif extracted and converted into
a profile. This profile was then used to search the dataset for any matching
allergens, which are then removed from the dataset. The remaining allergens are
submitted to the next round of motif discovery and removal. In total, 52 motifs
were discovered and 644 allergens in the dataset contain one or more of the 52
motifs. Incomplete sequence information for 78 allergens is the main reason why
135 allergens did not yield any motifs. The remaining 57 allergens are thought to
be unique allergens. The 52 discovered motifs can be applied to any novel
protein sequence to determine the significance of match. Typically, an E value of
Search WWH ::




Custom Search