Large-Scale Integration-Based Vector Field Visualization - Scientific Visualization: Uncertainty, Multifield, Biomedical, and Scalable Visualization

Image Processing Reference

In-Depth Information

for best performance. However, if this effort is undertaken, results on up to 256

processors indicate that the adaptive approach performs at least as good as the better

of POS or POB.

Peterka et al. [ 15 ] propose a different approach to achieve load balancing. They

partition integration into multiple rounds, with each round advancing integral curves

only over a small time interval. After each round, the work load is analyzed and

redistribution is performed. Load is primarily measured as the number of particles

residing on each processor.

To actually balance the load, geometric partitioning using recursive coordinate

bisection is employed; this ensures that particles assigned to a processor for the

next round are likely to reside in the same blocks, effecting an overall reduction

of I/O. Furthermore, since particle exchange is performed only between rounds,

communication can be optimized. Overall, reasonable scaling is observed up to

16,384 processors.

26.3.4 Hybrid Parallelism

To take advantage of modern parallel architectures where a single node typically

contains multiple processor cores with shared memory, Camp et al. [ 4 ] presented and

studied the performance of a hybrid parallel implementation of both POS and POB.

Their implementation is based on a combination of classical message passing across

nodes with multiple threads per node, and the paper focuses on comparing hybrid

algorithm variants to non-hybrid ones. In their hybrid implementation of POS and

POB, worker threads perform actual integration work, while I/O and communication

are managed by separate threads.

For POB, multiple separate I/O threads identify blocks to be loaded and initiate

I/O if there is room in the cache; when a block has finished loading, corresponding

integral curves are added to a work queue shared by the worker threads. In their imple-

mentation, the number of worker and I/O threads is in principle arbitrary. However,

they propose to use one worker thread and one I/O thread for each core to leverage

overlapping I/O.

Similarly, in the hybrid POB algorithm, several worker threads process streamlines

in the resident set of blocks, and a single communication thread is responsible for

receiving streamlines and sending them to other processors. Due to complications

with the MPI message passing library, however, the communication thread has to

resort to polling to identify sendable or newly received streamlines. Thus, one core

is exclusively dedicated to this task, while the remaining cores perform integration.

Comparing the performance of hybrid and non-hybrid implementations for several

test cases on 128 cores in total (32 nodes with 4 cores each), they report significant

performance gains. Largely, this is a consequence of the increased memory available

to each process, which positively influences cache size and thus block reuse in the

POS case, and reduces starvation in the POB algorithm. However, they also describe

Search WWH ::

Custom Search

Home