SaaS Clouds Supporting Biology and Medicine - Cloud Computing with e-Science Applications

Information Technology Reference

In-Depth Information

requesting the stored image and the required number and type of resources.

This deployment procedure was carried out for EXP-PAC.

To set up the EXP-PAC cloud image, a complex deployment procedure is

carried out (see Figure 11.8). First, an Ubuntu server AMI is selected from

the Amazon EC2 web interface and launched. Second, as this image is not

from a trusted source, steps must be taken to ensure the image has not

been compromised. Antivirus scans are performed, and the Ubuntu image

is updated to ensure there are no vulnerabilities. Next, using the Ubuntu

software repository, LAMP is installed; this software package contains the

principal components (APACHE, PHP, and mySQL) to build a viable general-

purpose web server. PHP and APACHE are configured, increasing the POST

and upload data limit to support large data upload and analysis. EXP-PAC is

then placed into the web server directory and configured to use the mySQL

database. To enable the HPC features of EXP-PAC, openMPI and bioconduc-

tor are also deployed on this server. The Amazon cloud image is then stored

in its modified form for future use.

Publication of the EXP-PAC virtual machine image to Uncinus was per-

formed through a web interface (see Figure 11.9). The virtual machine publi-

cation interface allows users to specify information about the published cloud

image that is used during deployment. The attributes required to publish a

virtual machine image are the AMI ID of the cloud image, a description of

the published cloud image, the supported instance types of the image, log-in

information, the home directory, and the OS utilized by the cloud image.

11.5.2 Workflow Execution

Once software has been deployed on the cloud, users can execute exposed

applications through published interfaces. To utilize the HPC normalization

methods provided by EXP-PAC, this case study was run on four cluster

compute instances (64-bit, dual-quad core; 23 GB RAM).

Breast cancer tumor RNAseq data (GSM721140) was downloaded from

the National Center for Biotechnology Information (NCBI). These data con-

tained 44.8 million sequence fragments, which were mapped (aligned) to the

human reference genome. To be analyzed, a number of preprocessing steps

were carried out on the data. First, SAMtools (Li et al. 2009) was used to

convert the downloaded data to a human-readable format. The converted

data were imported into HTSeq (Anders 2010) (run in union mode, non-

stranded), by which sequence fragments that matched known genes were

sorted and counted. The output of HTSeq was a list of genes and the amount

of times they appeared in the tumor.

In addition to the list of expressed genes, it was necessary to identify the

amount of mutations that had occurred in each gene. A mutation score was

given to each sequence by counting the bases that differed from the reference

genome. This process resulted in the creation of two data sets, a count of present

Search WWH ::

Custom Search

Home