Large-Scale Co-Phylogenetic Analysis on the Grid - Cloud, Grid and High Performance Computing: Emerging Applications

Information Technology Reference

In-Depth Information

of bioinformatics, biological insight is typically

generated via data analysis pipelines that use a

plethora of distinct and highly specialized tools.

Most commonly, bioinformaticians and biologists

collaborate to analyze data extracted from large

databases containing DNA and/or protein data in

order to study, e.g., the function of living beings,

the effect and influence of diseases and defects, or

their evolutionary history. Early “classic” bioin-

formatics tools, such as CLUSTALW (Thompson

et al., 1994) or BLAST (Altschul et al., 1997) that

have been ported to Grid computing environments

deal with biological sequence search, analysis,

and comparison. Typically, these programs are

embarrassingly parallel and therefore represent

ideal candidate applications for Grid computing

environments (Stockinger et al., 2006).

The study of the genome represents a way to

obtain new insight and extract novel knowledge

about living beings. In particular, stand-alone

phylogenetic analyses have many important ap-

plications in biological and medical research. Ap-

plications range from predicting the development

of emerging infectious diseases (Salzberg et al.,

2007), over the study of Papillomavirus evolution

that is associated with cervical cancer (Gottschling

et al., 2007), to the determination of the common

origin of Caribbean frogs (Heinicke et al., 2007).

Recent years have witnessed significant

progress in the field of stand-alone phylogeny

reconstruction algorithms, which represent an

NP-complete optimization problem (Chor and

Tuller, 2005), with the release of programs such

as TNT (Goloboff, 1999), RAxML (Stamatakis,

2006), MrBayes (Ronquist and Huelsenbeck,

2003) or GARLI (Zwickl, 2006). Because of the

continuous explosive accumulation and avail-

ability of molecular sequence data coupled with

advances in phylogeny reconstruction methods,

it has now become feasible to reconstruct and

fully analyze large phylogenetic trees comprising

hundreds or even thousands of sequences (organ-

isms). However, current meta-analysis methods for

phylogenetic trees such as programs that conduct

co-phylogenetic tests can currently not handle

such large datasets.

To alleviate this bottleneck in the meta-analysis

pipeline, we recently parallelized, and released

the highly optimized co-phylogenetic analysis

program AxParafit (Axelerated Parafit - Sta-

matakis et al., 2007) that implements an elaborate

statistical test of congruence between host and

parasite trees (Legendre et al., 2002). AxParafit

is a typical stand-alone Linux/Unix command line

program. AxParafit has been integrated and can

be invoked via a user-friendly graphical interface

for co-phylogenetic analyses called CopyCat

(Meier-Kolthoff et al., 2007). In this article, we

present an enhanced version of this tool suite

(henceforth denoted as CopyCat(AxParafit)) for

co-phylogenetic analyses, that is packaged into a

client tool which makes use of a world-wide Grid

environment and thereby allows for large-scale

data analysis. In the current version, the underly-

ing Grid middleware is gLite (Laure et al., 2006)

that is coupled with an efficient submission and

execution model called Run Time Sensitive (RTS)

scheduling and execution (Stockinger et al., 2006).

The remainder of this article is organized as

follows: initially, we provide a brief introduction

to the field of phylogenetic inference, co-phylo-

genetic analyses, and related software packages

in Section 2. Next, we discuss the implementation

and architecture of our new approach for efficient

adaptation of the CopyCat(AxParafit) tool-suite to

a Grid environment. Finally, we provide detailed

performance results on the EGEE (Enabling Grids

for E-SciencE, http://www.eu-egee.org) Grid

infrastructure (where the gLite middleware is

deployed in production mode) and demonstrate the

performance as well as scalability of our proposed

bioinformatics tool.

BACKGROUND

Phylogenetic (evolutionary) trees are used to

represent the evolutionary history of a set of s

Cloud, Grid and High Performance Computing: Emerging Applications

Search WWH ::

Custom Search

Home