Biomedical Engineering Reference
In-Depth Information
to read and preprocess the data. Therefore, the worker nodes do not need to spend
any time waiting for the input data (except in the first session).
The time taken to run Granger analysis depend on various parameters such
as the spatial resolution of the fMRI data, number of volumes, the parameter l
of Lasso regression, and the number of processors. For fMRI data with spatial
resolution of 32
64 voxels, and 500 volumes, the 1,024 processor Blue
Gene/L system typically takes between 30 and 60 minutes to run the Granger
causality analysis on a single session. A typical fMRI experiment with 100 sessions
thus takes 1 to 2 rack-days of Blue Gene/L compute time.
The implementation outlined above demonstrates how high-performance com-
puting can be used to carry out Granger causality analysis on the fMRI data. This
implementation, though far from optimal, provides a practical way to quickly pro-
totype the analysis technique and get results in a reasonable time. This implementa-
tion distributes voxels to processors which independently carry out the regression.
As a result, parallelization of the Lasso regression code was not required. (We took
an off-the-shelf implementation of Lasso based on LARS).
There are several limitations of this implementation which may need to be
addressed if the Granger causality was to be used routinely. The entire code is
based on a MATLAB implementation of the LARS algorithm. MATLAB scripts
are interpreted and inherently much slower than similar codes in C. Rewriting this
code is C is expected to give significant performance improvements.
The code scales well up to a thousand processors, but the gains of adding
more processors begin to taper off beyond 1,000 processors. There are two main
reasons for this. Firstly, the master node becomes the bottleneck as the number
of processors is increased. There is a single master node in the present imple-
mentation, which acts as a single point of contact for distributing the work and
writing the results to files. As the number of worker nodes increases, the mas-
ter needs to work faster distributing work to more processors and writing their
results. This can scale only to a certain point after which the master node be-
comes the bottleneck and slows down all the worker nodes. This problem is not
fundamental and can be overcome by using MPI-IO [19] for writing the results
and by having more than one master node when the number of processors is
large.
The second problem, which is more fundamental, is due to the fact that regres-
sion computations for different voxels end up taking different amounts of time.
In the extreme case, when the number of worker nodes is equal to the number of
voxels, every worker node carries out exactly one regression. In this case, the time
taken by the program will be almost equal to the maximum time taken in all the
regressions. This can be significantly more than the average time taken to run the
regressions. The only way to overcome this limit is by using a parallel implemen-
tation of the LARS algorithm. The algorithm is based on matrix multiplications
which can be parallelized. On fMRI size datasets, such a parallelization is expected
to yield significant performance improvements. However, there will still be limits
to such scaling.
An optimal implementation should employ both the strategies; that is, it should
partition the processors in groups and assign groups of voxels to the partitions.
Each processor group should then run a parallel version of the LARS algorithm
×
64
×
Search WWH ::




Custom Search