Distributed Workflows in Bioinformatics - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

of the 24 files of exons is broken up into five smaller files, resulting in a total of 120

files. These are then BLASTed against the database of transcripts.

The workflow as constructed in Wildfire is shown in Figure 23.8 c . The atomic

component exonx is a program developed in-house to extract and store exons from

a genbank file in fasta format; dice is a perl script used to break up a fasta file into

smaller pieces. The noop.sh components at the beginning and end are required

to make sure all the files are in the right place. These will be replaced in future

versions by an implicit merge, which will copy all the relevant files into the input

directory at the beginning and into the results directory at the end. The remaining

components (GNU gunzip and NCBI BLAST formatdb , and blastall ) are

standard applications that have been incorporated as atomic components using the

template builder provided.

The whole workflow takes less than 6000 s to run on a 128 CPU Pentium III cluster,

whereas a sequential version of the same workflow required almost nine times longer

[41]. The execution profile is shown in Figure 23.8. Further modifications to the

workflow should be able to improve this time.

23.7.2

Allergenicity Prediction

Allergens are proteins that induce allergic responses. More specifically, they elicit IgE

antibodies and cause the symptoms of allergy, which has been a major health problem

in developed countries [49]. With many transgenic proteins introduced into the food

chain, the need to predict their potential allergenicity has become a crucial issue.

Bioinformatics, more specifically, sequence analysis methods have an important role

in the identification of allergenicity [25, 29].

One approach to allergenicity prediction is to determine, automatically, motifs

from sequences in an allergenic database and then search for the identified motifs

in the query sequences. Li et al. [40] described an approach where protein sequence

motifs were identified using wavelet analysis [35]. The particular example consists

of 817 sequences in an allergen database. A 10-fold cross-validation test is conducted

where 90% of the sequences are used for motif identification with the remaining 10%

being used as query sequence for validation. This procedure is carried out a number

of times to obtain averaged values for recall and precision. The workflow is shown in

Figure 23.9. A brief description of the workflow is discussed as follows.

ClustalW is initially used to generate the pair-wise global alignment distances

among the randomly selected protein sequences. The pair-wise distances so obtained

are then used to cluster the protein sequences by partitioning around medoids using

the statistics tool R [3]. Each cluster of protein sequences is subsequently realigned

using ClustalW . The wavelet analysis technique developed by Krishnan et al. is then

used on each aligned cluster to identify motifs in the protein sequences.

HMM profiles [18, 20] are then generated for each identified motif using

hmmbuild . We use these profiles to search for the motifs in each query sequence

using hmmprofile , and thus predict whether it is an allergen. The accuracy of the

predictions is computed to assess the effectiveness of this approach.

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home