Information Technology Reference
In-Depth Information
of bioinformatics, biological insight is typically
generated via data analysis pipelines that use a
plethora of distinct and highly specialized tools.
Most commonly, bioinformaticians and biologists
collaborate to analyze data extracted from large
databases containing DNA and/or protein data in
order to study, e.g., the function of living beings,
the effect and influence of diseases and defects, or
their evolutionary history. Early “classic” bioin-
formatics tools, such as CLUSTALW (Thompson
et al., 1994) or BLAST (Altschul et al., 1997) that
have been ported to Grid computing environments
deal with biological sequence search, analysis,
and comparison. Typically, these programs are
embarrassingly parallel and therefore represent
ideal candidate applications for Grid computing
environments (Stockinger et al., 2006).
The study of the genome represents a way to
obtain new insight and extract novel knowledge
about living beings. In particular, stand-alone
phylogenetic analyses have many important ap-
plications in biological and medical research. Ap-
plications range from predicting the development
of emerging infectious diseases (Salzberg et al.,
2007), over the study of Papillomavirus evolution
that is associated with cervical cancer (Gottschling
et al., 2007), to the determination of the common
origin of Caribbean frogs (Heinicke et al., 2007).
Recent years have witnessed significant
progress in the field of stand-alone phylogeny
reconstruction algorithms, which represent an
NP-complete optimization problem (Chor and
Tuller, 2005), with the release of programs such
as TNT (Goloboff, 1999), RAxML (Stamatakis,
2006), MrBayes (Ronquist and Huelsenbeck,
2003) or GARLI (Zwickl, 2006). Because of the
continuous explosive accumulation and avail-
ability of molecular sequence data coupled with
advances in phylogeny reconstruction methods,
it has now become feasible to reconstruct and
fully analyze large phylogenetic trees comprising
hundreds or even thousands of sequences (organ-
isms). However, current meta-analysis methods for
phylogenetic trees such as programs that conduct
co-phylogenetic tests can currently not handle
such large datasets.
To alleviate this bottleneck in the meta-analysis
pipeline, we recently parallelized, and released
the highly optimized co-phylogenetic analysis
program AxParafit (Axelerated Parafit - Sta-
matakis et al., 2007) that implements an elaborate
statistical test of congruence between host and
parasite trees (Legendre et al., 2002). AxParafit
is a typical stand-alone Linux/Unix command line
program. AxParafit has been integrated and can
be invoked via a user-friendly graphical interface
for co-phylogenetic analyses called CopyCat
(Meier-Kolthoff et al., 2007). In this article, we
present an enhanced version of this tool suite
(henceforth denoted as CopyCat(AxParafit)) for
co-phylogenetic analyses, that is packaged into a
client tool which makes use of a world-wide Grid
environment and thereby allows for large-scale
data analysis. In the current version, the underly-
ing Grid middleware is gLite (Laure et al., 2006)
that is coupled with an efficient submission and
execution model called Run Time Sensitive (RTS)
scheduling and execution (Stockinger et al., 2006).
The remainder of this article is organized as
follows: initially, we provide a brief introduction
to the field of phylogenetic inference, co-phylo-
genetic analyses, and related software packages
in Section 2. Next, we discuss the implementation
and architecture of our new approach for efficient
adaptation of the CopyCat(AxParafit) tool-suite to
a Grid environment. Finally, we provide detailed
performance results on the EGEE (Enabling Grids
for E-SciencE, http://www.eu-egee.org) Grid
infrastructure (where the gLite middleware is
deployed in production mode) and demonstrate the
performance as well as scalability of our proposed
bioinformatics tool.
BACKGROUND
Phylogenetic (evolutionary) trees are used to
represent the evolutionary history of a set of s
Search WWH ::




Custom Search