Large-Scale Co-Phylogenetic Analysis on the Grid - Cloud, Grid and High Performance Computing: Emerging Applications

Information Technology Reference

In-Depth Information

Initially, these files are parsed and transformed

into the appropriate file format by CopyCat. In

a first step, a principal coordinate analysis is

conducted on the respective tree-based distance

matrices induced by the host and parasite trees.

This analysis is carried out by the AxPcoords

(Axelerated Principal Coordinates) program (Sta-

matakis et al., 2007), which is an optimized version

of the analogous DistPCoA program (Legendre

and Anderson, 1998). The output of AxPcoords

for the host and parasite trees is then parsed and

appropriately prepared for the AxParafit analysis

which takes the two principal coordinates matrices

and the binary matrix with the associations as

input. The output of this computation is a list of

probabilities for the individual null hypotheses

that a certain association does not improve the fit

between host and associate phylogenies. In addi-

tion, a probability for the global null hypothesis

of the absence of congruence between host and

parasite trees is computed. Upon termination of

AxParafit the output files are read by the CopyCat

tool and presented in a human-readable format.

It is important to note that the computations with

AxParafit represent the by far largest part (over

95%) of the computational effort required to con-

duct such a co-phylogenetic analysis. Therefore,

the AxPcoords and CopyCat parts of the workflow

can be handled sequentially and executed locally.

We will, thus, mainly focus on the parallel and

gridified versions of AxParafit in the next sections.

The basic workflow is outlined in Figure 2 (at the

end of the article).

Initially, the program will compute the statistics

for the global congruence of the complete list of

host-parasite associations. This part of the com-

putation is significantly less expensive than the

individual tests for each host-parasite association,

which take nz times longer, where nz is the number

of non-zero entries in the binary association matrix,

i.e., number of entries in the original host-parasite

association list. For large datasets that require

parallel and distributed computing resources as

well as a sufficient amount of memory typically

nz >> 1 . The statistics computed during the global

test of congruence are required as input data for

the individual tests of host-parasite associations,

hence there is a sequential dependency: global test

→ nz local tests. Thus, in the MPI-based parallel

implementation we only parallelized these nz lo-

cal tests which can be computed independently of

each other via a straight-forward master-worker

scheme. The master simply distributes the nz

individual host-parasite association tests to the

worker processes.

The potential bottleneck induced by the se-

quential part of the computations can be allevi-

ated by using, e.g., the respective shared-memory

implementations of BLAS. With respect to a

gridification, this sequential dependency actually

has advantages. Since the inference time as well

as memory footprint of the global test of congru-

ence are nearly identical (same type of operation,

identical matrix sizes, permuted input data) to

each of the individual nz tests, the information

on run-times and memory requirements collected

during the global tests can be used for scheduling

decisions, as well as to determine an optimal level

of granularity and to assess respective resource

requirements.

Parallel AxParafit

The most compute-intensive operation (95%

of execution time) conducted by AxParafit to

compute the statistics is a dense matrix-matrix

multiplication of double precision floating point

numbers. This is the rationale for integration of

highly optimized BLAS routines. In the remainder

of this article, we thus always refer to the BLAS-

based version of AxParafit.

FIT FOR THE GRID

In the following section, we describe how

CopyCat(AxParafit) has been adapted and modi-

fied for use in a Grid environment. The overall

Cloud, Grid and High Performance Computing: Emerging Applications

Search WWH ::

Custom Search

Home