COLLABORATIVE - BASED BIOINFORMATICS APPLICATIONS - Collaborative Computational Technologies for Biomedical Research

Biomedical Engineering Reference

In-Depth Information

accomplish this. Some of these are commercial, such as Sequest [5] and Mascot

[6], and some are open source or public domain, such as X!Tandem [7] or

OMSSA [8] .

Peptide identifi cation from mass spectrometry data is amenable to cloud

computing in that the data set consists of tens of thousands of individual frag-

mentation spectra and the peptide identifi cation process is more or less inde-

pendent from spectra to spectra. This allows the use of a MapReduce-like

strategy in which worker nodes can be assigned packets of spectra to search,

and they can return their results to a common area for integration when all

the searches are completed. This works well because the majority of the com-

putation effort is expended in the individual searches rather than in splitting

the data or combining the results.

To allow for high-throughput analysis of proteomics data, we have devel-

oped the Virtual Proteomics Data Analysis Cluster (ViPDAC) system. ViPDAC

is based on the AWS EC2 and S3 systems and relies on the use of open-source

algorithms and programs for peptide identifi cation and open-source software

developed for ViPDAC to distribute spectra, manage worker nodes, and sum-

marize the results. ViPDAC is available as a public AMI that can be launched

by anyone having an AWS account. The ViPDAC AMI includes an integrated

Web server so that interactions with the end user and the ViPDAC head node

occur through the use of a familiar Web interface. Through this interface, the

end user can choose data sets and analysis parameters and add or terminate

worker nodes. Raw data are fi rst uploaded to the end user's S3 storage area

and results are returned to the user's S3 or through a download link on the

website.

Since ViPDAC was developed before the MapReduce function of AWS

was available, it uses its own facilities to distribute spectra to the worker

nodes, manage the nodes, and collect the results. When the end user launches

the initial ViPDAC instance, this instance confi gures itself to be the head

node and controls the distribution and retrieval of data. When subsequent

instances are launched by the same user, they recognize that the head node

exists and confi gure themselves as worker nodes. Worker nodes then make

requests to the head node for packets of spectra to search. The head node

then responds with a message to the worker node, informing it of the location

of the compressed fi le containing multiple spectra, parameters, and database

to use for the search. When the searches are complete, the worker informs

the head node that the packet has been completed and the data are collected.

If the head node does not receive a message that the searches are complete

within the specifi ed time, the head node then considers that the worker node

has failed and returns the packet of spectra back to the queue for analysis

by a different node. An issue with this system is that the amount of time a

search uses can vary greatly due to the complexity of the spectra, the size of

the database, and the parameters chosen. For this reason, it is important for

the end user not to choose a timeout value too short to complete a given set

of spectra. This parameter can also be adjusted by changing the size of the

Collaborative Computational Technologies for Biomedical Research

Search WWH ::

Custom Search

Home