COLLABORATIVE - BASED BIOINFORMATICS APPLICATIONS - Collaborative Computational Technologies for Biomedical Research

Biomedical Engineering Reference

In-Depth Information

spectra packet. For particularly diffi cult searches, such as those with no

peptide specifi city (unconstrained searches), it is advisable to use a spectra

packet size of one spectra.

23.3.2

Other Bioinformatics Tools

The J. Craig Venter Institute (JCVI) has produced an AMI preconfi gured with

many of the standard bioinformatics tools that they have titled JCVI Cloud

Bio-Linux. The instance is based on 64-bit Ubuntu Linux and contains the

Celera Assembler [9], the European Molecular Biology Open Software Suite

[10] , BLAST [11] , ClustalW [12] , Glimmer [13] , GeneSpring [14] , HMMER

[15], PHYLIP [16], and RasMol [17]. The goal of the project is to produce a

platform with which groups could use to set up and distribute bioinformatics

analysis systems and data. The hope is to overcome the diffi culties in installing

and setting up bioinformatics tools.

23.3.3

Next - Generation DNA Sequencing

One of the most signifi cant challenges of bioinformatics is the analysis of the

huge volume of data generated by next-generation DNA sequencing efforts

[18]. This process produces millions of short sequence reads which must be

aligned and merged to produce the fi nal sequence. As the rate of sequencing

has accelerated, the data storage requirements have moved from megabytes

to gigabytes to terabytes and soon to petabytes. The computational time to

process these data has similarly increased. To address this, systems using cloud

computing have been developed. One of the uses of next-generation sequenc-

ing is the mapping of genomes and identifi cation of single-nucleotide poly-

morphism (SNPs). The CloudBurst application (described below) uses AWS

MapReduce and Hadoop to generate a cluster of computers to process the

alignment of reads from next-generation sequencing instruments [18]. The

algorithm is based on aligning reads to a reference genome and then extending

the alignment by adding additional reads. This is expedited by the hosting of

Ensembl and GenBank genomic data in S3. This makes the required reference

genome data available with low latency and no cost for transfer and storage.

The Crossbow system for DNA sequence alignment and SNP discovery

developed at Johns Hopkins University uses cloud computing to align high-

throughput DNA sequencing reads and fi nd individual polymorphisms [19]. It

combines Bowtie [20] to align short reads and SoapSNP [21] to call genotypes.

It is based on MapReduce and uses Hadoop to parallelize the computational

load across multiple AWS instances. According to the developers, it can analyze

over 35 times coverage of a human genome in 3 hours for about $85 using a

40-node, 320-core cluster rented from Amazon Web Services.

A similar program, also developed at Johns Hopkins University, is Myrna

[22]. Myrna also uses Bowtie and Hadoop, but rather than assemble entire

genomes, it measures gene expression by analyzing RNA-seq data sets. Like

Collaborative Computational Technologies for Biomedical Research

Search WWH ::

Custom Search

Home