COLLABORATIVE - BASED BIOINFORMATICS APPLICATIONS - Collaborative Computational Technologies for Biomedical Research

Biomedical Engineering Reference

In-Depth Information

added to EC2. Elastic Block Storage is a persistent data store that can be

attached to an instance and read from and written to much like an external

hard drive on a physical computer making data transfer easier than to S3. Like

S3, EC2 instances can be located in different geographical locations to decrease

network latency. When running, instances can be accessed by using a network

address provided when the instance is launched. Additionally, an IP address

can be assigned to an instance and an instance can include a Web server to

allow for Web access to programs running on the instance. For additional

security, a virtual private cloud using virtual private network (VPN) technol-

ogy can be created. This allows institutions to connect to the AWS cloud as

though it was part of the institution's network. In order to monitor and control

running instances, Amazon has the Cloudwatch monitoring service, elastic

load balancing, and autoscaling.

An important computational resource for bioinformatics is the Amazon

Elastic MapReduce service. MapReduce is built on the Hadoop framework.

Hadoop is a system that creates a compute cluster from a collection of virtual

instances. It supports data-intensive distributed applications by creating a

distributed fi le system that allows individual nodes to share data and job

tracker and task tracker functions that oversee the analysis of the data by the

individual instances. The MapReduce service takes problems that can be

broken down to smaller elements and automates their analysis. These so-called

embarrassingly parallel problems are characterized by having data elements

that can be analyzed independently from the entire data set. A good example

from proteomics is the peptide identifi cation from mass spectra. A mass spec-

troscopy run can be broken down into individual spectra. Each of the spectra

can be compared to the peptide sequence database to fi nd the best match in

the database, and the results from the individual searches can be combined to

produce the fi nal search results. MapReduce automates the splitting of the

data, the “map” function, the establishment and oversight of the worker

Hadoop cluster instances, and the combination of the results produced by the

individual workers, the “ reduce ” function.

There are other AWS services available that can be used in concert with

EC2 and S3. These include message management services such as Amazon

Simple Queue Service (Amazon SQS), which allows instances to exchange

messages and coordinate the parallel analysis of data, and Amazon Simple

Notifi cation Service (Amazon SNS), which allows running instances to send

messages to other instances, servers, or end users that subscribe to the mes-

sages from the instances. This allows workfl ows composed of AWS instances

to respond to events. Additionally AWS offers two database services. Amazon

SimpleDB is a simple nonrelational database that provides easy access to data

with a high degree of availability and scalability. For more demanding needs,

AWS also offers a relational database service, Amazon Relational Database

Service (Amazon RDS), which provides a cloud-based relational database

equivalent to MySQL and is compatible with applications that use MySQL.

Collaborative Computational Technologies for Biomedical Research

Search WWH ::

Custom Search

Home