Image Processing Reference
In-Depth Information
for best performance. However, if this effort is undertaken, results on up to 256
processors indicate that the adaptive approach performs at least as good as the better
of POS or POB.
Peterka et al. [ 15 ] propose a different approach to achieve load balancing. They
partition integration into multiple rounds, with each round advancing integral curves
only over a small time interval. After each round, the work load is analyzed and
redistribution is performed. Load is primarily measured as the number of particles
residing on each processor.
To actually balance the load, geometric partitioning using recursive coordinate
bisection is employed; this ensures that particles assigned to a processor for the
next round are likely to reside in the same blocks, effecting an overall reduction
of I/O. Furthermore, since particle exchange is performed only between rounds,
communication can be optimized. Overall, reasonable scaling is observed up to
16,384 processors.
26.3.4 Hybrid Parallelism
To take advantage of modern parallel architectures where a single node typically
contains multiple processor cores with shared memory, Camp et al. [ 4 ] presented and
studied the performance of a hybrid parallel implementation of both POS and POB.
Their implementation is based on a combination of classical message passing across
nodes with multiple threads per node, and the paper focuses on comparing hybrid
algorithm variants to non-hybrid ones. In their hybrid implementation of POS and
POB, worker threads perform actual integration work, while I/O and communication
are managed by separate threads.
For POB, multiple separate I/O threads identify blocks to be loaded and initiate
I/O if there is room in the cache; when a block has finished loading, corresponding
integral curves are added to a work queue shared by the worker threads. In their imple-
mentation, the number of worker and I/O threads is in principle arbitrary. However,
they propose to use one worker thread and one I/O thread for each core to leverage
overlapping I/O.
Similarly, in the hybrid POB algorithm, several worker threads process streamlines
in the resident set of blocks, and a single communication thread is responsible for
receiving streamlines and sending them to other processors. Due to complications
with the MPI message passing library, however, the communication thread has to
resort to polling to identify sendable or newly received streamlines. Thus, one core
is exclusively dedicated to this task, while the remaining cores perform integration.
Comparing the performance of hybrid and non-hybrid implementations for several
test cases on 128 cores in total (32 nodes with 4 cores each), they report significant
performance gains. Largely, this is a consequence of the increased memory available
to each process, which positively influences cache size and thus block reuse in the
POS case, and reduces starvation in the POB algorithm. However, they also describe
 
Search WWH ::




Custom Search