Biomedical Engineering Reference
In-Depth Information
accomplish this. Some of these are commercial, such as Sequest [5] and Mascot
[6], and some are open source or public domain, such as X!Tandem [7] or
OMSSA [8] .
Peptide identifi cation from mass spectrometry data is amenable to cloud
computing in that the data set consists of tens of thousands of individual frag-
mentation spectra and the peptide identifi cation process is more or less inde-
pendent from spectra to spectra. This allows the use of a MapReduce-like
strategy in which worker nodes can be assigned packets of spectra to search,
and they can return their results to a common area for integration when all
the searches are completed. This works well because the majority of the com-
putation effort is expended in the individual searches rather than in splitting
the data or combining the results.
To allow for high-throughput analysis of proteomics data, we have devel-
oped the Virtual Proteomics Data Analysis Cluster (ViPDAC) system. ViPDAC
is based on the AWS EC2 and S3 systems and relies on the use of open-source
algorithms and programs for peptide identifi cation and open-source software
developed for ViPDAC to distribute spectra, manage worker nodes, and sum-
marize the results. ViPDAC is available as a public AMI that can be launched
by anyone having an AWS account. The ViPDAC AMI includes an integrated
Web server so that interactions with the end user and the ViPDAC head node
occur through the use of a familiar Web interface. Through this interface, the
end user can choose data sets and analysis parameters and add or terminate
worker nodes. Raw data are fi rst uploaded to the end user's S3 storage area
and results are returned to the user's S3 or through a download link on the
website.
Since ViPDAC was developed before the MapReduce function of AWS
was available, it uses its own facilities to distribute spectra to the worker
nodes, manage the nodes, and collect the results. When the end user launches
the initial ViPDAC instance, this instance confi gures itself to be the head
node and controls the distribution and retrieval of data. When subsequent
instances are launched by the same user, they recognize that the head node
exists and confi gure themselves as worker nodes. Worker nodes then make
requests to the head node for packets of spectra to search. The head node
then responds with a message to the worker node, informing it of the location
of the compressed fi le containing multiple spectra, parameters, and database
to use for the search. When the searches are complete, the worker informs
the head node that the packet has been completed and the data are collected.
If the head node does not receive a message that the searches are complete
within the specifi ed time, the head node then considers that the worker node
has failed and returns the packet of spectra back to the queue for analysis
by a different node. An issue with this system is that the amount of time a
search uses can vary greatly due to the complexity of the spectra, the size of
the database, and the parameters chosen. For this reason, it is important for
the end user not to choose a timeout value too short to complete a given set
of spectra. This parameter can also be adjusted by changing the size of the
Search WWH ::




Custom Search