Information Technology Reference
In-Depth Information
partitions depend of the algorithm structure and computer architecture. Fortu-
nately, the loop only sets a given value in the physical variable VX of the ME
and consequently, there are no loop dependencies in LP between one iteration
and the subsequent ones. The LP scheme is easy to implement in a distributed
memory scheme (MP); furthermore, since there are no loop dependencies, the
communication cost is low. On the other hand, implementing LP in a shared
memory platform (MC) is not straightforward due to resource constraints (e.g.
resources such as cache size -level 1, 2 and 3-, bus access policies and data
flow restrictions) and the high number of variables involved in the computation.
Nevertheless, in any case, it is necessary to determine hotspots and bottlenecks
using profilers such as Valgrind [7] or Vtune [8].
5 Algorithm Parallelization
The main idea of the parallel algorithm is to get maximum performance by
exploiting all of the processing resources of the cluster architecture (computer
nodes and their cores.) As a first approach however it is natural that the algo-
rithm uses the LP scheme in a distributed memory platform and the TP scheme
in a shared memory platform. Nevertheless, in order to achieve higher speedups
it is critical to use the most appropriate partitioning scheme(s).
5.1 LP Partitioning Scheme
Figure 3 suggests that the majority of code is placed inside the loop. This is
indeed the case, as Valgrind shows that, for the worst-case execution time sce-
nario, 95.16% 11 of processing time occurs inside this loop. This fact is beneficial
for parallelization purposes to achieve higher speedups.
The LP partitioning uses domain decomposition to split the range [ VLL ,
VULL ] of the physical variable VX in several sub-domains and then assign
them to every processing unit (computer nodes or cores) as shown in Fig. 5. The
number of chunks per processing unit is determined by VINC (step increment)
and the number of computers, np . These chunks are uniformly distributed among
the processing units to keep the processing load balanced.
5.2 TP Partitioning Scheme
The results of the measurement analysis by executing Valgrind showed that
“Solve ME” is a hotspot (with the 91.3% of processing time). Inside of this
function, the integration method consumes roughly 78% of the processing time,
while the rest of code (13.3%) are strictly serial statements. Therefore, the opti-
mization efforts need to be focused there. This task solves a large coupled ODE
system of the ME 12 using an integration method. To solve the ME, QDsim im-
plements different integration methods such as Adams-Bourdon for stiff systems,
Backward Differentiation Formula and Euler.
11 95 . 16 loop =0 . 02 generateME +91 . 3 solveME +3 . 84 computeConcurrence .
12 Generated by the task “Generate ME”.
 
Search WWH ::




Custom Search