Applications of High-Performance Computing to Functional Magnetic Resonance Imaging (fMRI) Data - High-Throughput Image Reconstruction and Analysis

Biomedical Engineering Reference

In-Depth Information

to read and preprocess the data. Therefore, the worker nodes do not need to spend

any time waiting for the input data (except in the first session).

The time taken to run Granger analysis depend on various parameters such

as the spatial resolution of the fMRI data, number of volumes, the parameter l

of Lasso regression, and the number of processors. For fMRI data with spatial

resolution of 32

64 voxels, and 500 volumes, the 1,024 processor Blue

Gene/L system typically takes between 30 and 60 minutes to run the Granger

causality analysis on a single session. A typical fMRI experiment with 100 sessions

thus takes 1 to 2 rack-days of Blue Gene/L compute time.

The implementation outlined above demonstrates how high-performance com-

puting can be used to carry out Granger causality analysis on the fMRI data. This

implementation, though far from optimal, provides a practical way to quickly pro-

totype the analysis technique and get results in a reasonable time. This implementa-

tion distributes voxels to processors which independently carry out the regression.

As a result, parallelization of the Lasso regression code was not required. (We took

an off-the-shelf implementation of Lasso based on LARS).

There are several limitations of this implementation which may need to be

addressed if the Granger causality was to be used routinely. The entire code is

based on a MATLAB implementation of the LARS algorithm. MATLAB scripts

are interpreted and inherently much slower than similar codes in C. Rewriting this

code is C is expected to give significant performance improvements.

The code scales well up to a thousand processors, but the gains of adding

more processors begin to taper off beyond 1,000 processors. There are two main

reasons for this. Firstly, the master node becomes the bottleneck as the number

of processors is increased. There is a single master node in the present imple-

mentation, which acts as a single point of contact for distributing the work and

writing the results to files. As the number of worker nodes increases, the mas-

ter needs to work faster distributing work to more processors and writing their

results. This can scale only to a certain point after which the master node be-

comes the bottleneck and slows down all the worker nodes. This problem is not

fundamental and can be overcome by using MPI-IO [19] for writing the results

and by having more than one master node when the number of processors is

large.

The second problem, which is more fundamental, is due to the fact that regres-

sion computations for different voxels end up taking different amounts of time.

In the extreme case, when the number of worker nodes is equal to the number of

voxels, every worker node carries out exactly one regression. In this case, the time

taken by the program will be almost equal to the maximum time taken in all the

regressions. This can be significantly more than the average time taken to run the

regressions. The only way to overcome this limit is by using a parallel implemen-

tation of the LARS algorithm. The algorithm is based on matrix multiplications

which can be parallelized. On fMRI size datasets, such a parallelization is expected

to yield significant performance improvements. However, there will still be limits

to such scaling.

An optimal implementation should employ both the strategies; that is, it should

partition the processors in groups and assign groups of voxels to the partitions.

Each processor group should then run a parallel version of the LARS algorithm

×

64

×

High-Throughput Image Reconstruction and Analysis

Search WWH ::

Custom Search

Home