Biomedical Engineering Reference
In-Depth Information
of the 24 files of exons is broken up into five smaller files, resulting in a total of 120
files. These are then BLASTed against the database of transcripts.
The workflow as constructed in Wildfire is shown in Figure 23.8 c . The atomic
component exonx is a program developed in-house to extract and store exons from
a genbank file in fasta format; dice is a perl script used to break up a fasta file into
smaller pieces. The noop.sh components at the beginning and end are required
to make sure all the files are in the right place. These will be replaced in future
versions by an implicit merge, which will copy all the relevant files into the input
directory at the beginning and into the results directory at the end. The remaining
components (GNU gunzip and NCBI BLAST formatdb , and blastall ) are
standard applications that have been incorporated as atomic components using the
template builder provided.
The whole workflow takes less than 6000 s to run on a 128 CPU Pentium III cluster,
whereas a sequential version of the same workflow required almost nine times longer
[41]. The execution profile is shown in Figure 23.8. Further modifications to the
workflow should be able to improve this time.
23.7.2
Allergenicity Prediction
Allergens are proteins that induce allergic responses. More specifically, they elicit IgE
antibodies and cause the symptoms of allergy, which has been a major health problem
in developed countries [49]. With many transgenic proteins introduced into the food
chain, the need to predict their potential allergenicity has become a crucial issue.
Bioinformatics, more specifically, sequence analysis methods have an important role
in the identification of allergenicity [25, 29].
One approach to allergenicity prediction is to determine, automatically, motifs
from sequences in an allergenic database and then search for the identified motifs
in the query sequences. Li et al. [40] described an approach where protein sequence
motifs were identified using wavelet analysis [35]. The particular example consists
of 817 sequences in an allergen database. A 10-fold cross-validation test is conducted
where 90% of the sequences are used for motif identification with the remaining 10%
being used as query sequence for validation. This procedure is carried out a number
of times to obtain averaged values for recall and precision. The workflow is shown in
Figure 23.9. A brief description of the workflow is discussed as follows.
ClustalW is initially used to generate the pair-wise global alignment distances
among the randomly selected protein sequences. The pair-wise distances so obtained
are then used to cluster the protein sequences by partitioning around medoids using
the statistics tool R [3]. Each cluster of protein sequences is subsequently realigned
using ClustalW . The wavelet analysis technique developed by Krishnan et al. is then
used on each aligned cluster to identify motifs in the protein sequences.
HMM profiles [18, 20] are then generated for each identified motif using
hmmbuild . We use these profiles to search for the motifs in each query sequence
using hmmprofile , and thus predict whether it is an allergen. The accuracy of the
predictions is computed to assess the effectiveness of this approach.
Search WWH ::




Custom Search