MBSPDiscover: An Automatic Benchmark for MultiBSP Performance Analysis - High Performance Computing

Information Technology Reference

In-Depth Information

Table 1. Computed values for g and L parameters for the studied architectures

dell 32

jolly

level g (flops/word) L (flops)

2

977.5

15550.2

3

1315.9

16184.4

1

334.9

7792.9

2

549.9

7157.9

1

105.3

498.2

Finally, using the least squares method we estimate the values of g i and L i

over the h -communications for each level. The final values for dell32 and jolly

are reported in Table 1.

4.3 Validation of Results

For validating the results computed in the previous subsection, we conducted an

experiment using a real application, the vector inner product from BSPedupack

(actually the computation of the norm of a vector), described in Algorithm 1.3

in the MultiBSP programming model. We plan to extend the validation by con-

sidering a set of benchmark applications as future work.

1

innerProduct(level, vector) {

2

if (level.next == NULL ) {

3

return sequentialInnerProduct(vector);

4

} else {

5

begin_parallel_multibsp ( level.sons.length )

6

ownslice = split_vector(vector, multibsp_pid );

7

level = level.sons[ multibsp_pid ];

8

sync()

9

results = innerProduct(level, ownslice)

10

sync()

11

if (multbsp_id == master) {

12

return sequentialInnerProduct(results);

13

}

14

end_parallel_multibsp

15

}

16

}

17

MBSPTree = MBSPDiscover()

18

innerProduct(MBSPTree, data_vector)

Algorithm 1.3. Vector Inner Product.

Algorithm 1.3 applies the MultiBSP programming model recursively, crossing the

MCBSPTree obtained with MBSPDiscover in the proposed benchmark. Using

the tree structure, the data vector is split in slices for each thread at level i .

For i> 0, the data splitting is applied recursively. In level 0, a sequential inner

product algorithm is used to compute a partial result. Then, after synchronizing

all threads in each level, the result is the inner product for the whole data

vector. The master thread applies a reduction phase, combining all results using

the sequential inner product and then returns the result to the upper level.

The validation involves the following steps (applied for different vector sizes):

1. Estimate the amount of communications and synchronizations at each level,

by using hardware counters.

High Performance Computing

Search WWH ::

Custom Search

Home