A Numerical Solution for Wootters Correlation - High Performance Computing

Information Technology Reference

In-Depth Information

partitions depend of the algorithm structure and computer architecture. Fortu-

nately, the loop only sets a given value in the physical variable VX of the ME

and consequently, there are no loop dependencies in LP between one iteration

and the subsequent ones. The LP scheme is easy to implement in a distributed

memory scheme (MP); furthermore, since there are no loop dependencies, the

communication cost is low. On the other hand, implementing LP in a shared

memory platform (MC) is not straightforward due to resource constraints (e.g.

resources such as cache size -level 1, 2 and 3-, bus access policies and data

flow restrictions) and the high number of variables involved in the computation.

Nevertheless, in any case, it is necessary to determine hotspots and bottlenecks

using profilers such as Valgrind [7] or Vtune [8].

5 Algorithm Parallelization

The main idea of the parallel algorithm is to get maximum performance by

exploiting all of the processing resources of the cluster architecture (computer

nodes and their cores.) As a first approach however it is natural that the algo-

rithm uses the LP scheme in a distributed memory platform and the TP scheme

in a shared memory platform. Nevertheless, in order to achieve higher speedups

it is critical to use the most appropriate partitioning scheme(s).

5.1 LP Partitioning Scheme

Figure 3 suggests that the majority of code is placed inside the loop. This is

indeed the case, as Valgrind shows that, for the worst-case execution time sce-

nario, 95.16% 11 of processing time occurs inside this loop. This fact is beneficial

for parallelization purposes to achieve higher speedups.

The LP partitioning uses domain decomposition to split the range [ VLL ,

VULL ] of the physical variable VX in several sub-domains and then assign

them to every processing unit (computer nodes or cores) as shown in Fig. 5. The

number of chunks per processing unit is determined by VINC (step increment)

and the number of computers, np . These chunks are uniformly distributed among

the processing units to keep the processing load balanced.

5.2 TP Partitioning Scheme

The results of the measurement analysis by executing Valgrind showed that

“Solve ME” is a hotspot (with the 91.3% of processing time). Inside of this

function, the integration method consumes roughly 78% of the processing time,

while the rest of code (13.3%) are strictly serial statements. Therefore, the opti-

mization efforts need to be focused there. This task solves a large coupled ODE

system of the ME 12 using an integration method. To solve the ME, QDsim im-

plements different integration methods such as Adams-Bourdon for stiff systems,

Backward Differentiation Formula and Euler.

11 95 . 16 loop =0 . 02 generateME +91 . 3 solveME +3 . 84 computeConcurrence .

12 Generated by the task “Generate ME”.

Search WWH ::

Custom Search

Home