Image Processing Reference
In-Depth Information
In their findings, the communication cost incurred by block transitions plays a
secondary role to the overall number of blocks traversed. Essentially, in the case
of a dense seed set, streamlines in the same spatial regions must be computed by
comparatively few processors owning the required blocks, leading to a large load
imbalance. A similar observation holds for vector fields with regions of strong con-
vergence, which tends to concentrate trajectories in small regions of a data set. To
counteract this effect, they propose to assign blocks to processors in round-robin
fashion or randomly.
Nouanesengsy et al.[ 14 ] provided an approach to address the load imbalances in
POB by using workload-aware partitioning of the vector field data. Treating block
assignment as an optimization problem, they compute a flow graph from an abstract
representation of the vector field. In conjunction with a given seed set, the flow graph
provides an estimate of blocks that need to be traversed to compute curves, and the
resulting set of blocks is then partitioned to minimize communication cost. Addi-
tionally, to leverage resources optimally, block replication is allowed, addressing
starvation problems. In experiments, they observe better performance and scalabil-
ity with respect to a round-robin assignment; however, they incur a non-negligible
preprocessing cost per data set.
While it is straightforward to implement, the POB algorithm has one severe lim-
itation; if the combined memory on all processors cannot accommodate the entire
data set, this algorithm cannot be used. This is especially the case for time-varying
vector fields.
26.3.3 Adaptive Load Balancing
As apparent from the discussion above, POS and POB represent two extremes of a
spectrum of parallelization schemes, focusing exclusively on either work distribution
or data distribution. To address the respective shortcomings of either choice, Pugmire
et al. [ 16 ] describe an adaptive algorithm that performs dynamic distribution of both
integral curves and blocks, attempting to balance I/O and communication loads at
run-time based on heuristics.
Algorithmically, a master processor coordinates the workloads of a fixed set of
slave processors. The master initially determines a seed point assignment for the
slaves; as work progresses, it monitors the length of each slave's assigned integral
curve set and the blocks that are loaded. To avoid overload and starvation, it then
reassigns individual curves or entire blocks among slaves. Once the master deter-
mines that all curves have been computed, it instructs the slaves to terminate. This
adaptive approach entails significant complexity, which is expressed in complicated
heuristics. The authors also investigate the scalability of their approach, but find that
using a single master becomes a bottleneck as the number of processors grows. They
postulate that a hierarchical approach with a hierarchy of master processes should
perform better at scale. Furthermore, the load-balancing heuristics rely on several
machine-dependent parameters such as I/O and communication latency and speed
 
Search WWH ::




Custom Search