MBSPDiscover: An Automatic Benchmark for MultiBSP Performance Analysis - High Performance Computing

Information Technology Reference

In-Depth Information

(#3), eight components from level 2 are grouped. They share the RAM memory,

with a size of 128GB, as specified by tuple 3 =

p 3 =8 ,m 3 = 128GB ,g 3 ,L 3

.

Finally, using the same procedure we previously applied to the dell32 ar-

chitecture (i.e. joining all tuples and discarding level 0 ), we get the MultiBSP

specification in Eq. 2.

M 2 =[ p 1 =2 ,m 1 =2MB ,g 1 ,L 1 , p 2 =4 ,m 2 =6MB ,g 2 ,L 2 ,

p 3 =8 ,m 3 = 128 GB ,g 3 ,L 3 ] )

Using these instances of the MultiBSP model, we can predict the running time

of a MultiBSP algorithm executed in each machine. The g i and L i parameters

in each tuple must be previously calculated using the benchmarking procedure

explained in the previous section. Next section reports the values of g and L

obtained for both architectures at each level.

4.2 Results

We report the time to perform h -communications in each level, increasing the

number h as in the coreBenchmark function. Reporting the flops for each h -

communications is important because we compute the g i and L i using least

squares to estimate the parameters at each level.

(a) Instance #1: dell32

(b) Instance #2: jolly

Fig. 7. Time to perform from h -communications per level in a MultiBSP tree, with h

between 0 and 256

Figure 7 show the h i communications in each level for dell32 (level 1 and level 2 )

and jolly (levels 1, 2, and 3). In level 1 of dell32 , the communications are within

the shared memory (L3 cache), so they are twice faster than in level 2 ,which

use the RAM memory. For jolly , the communications in level 1 are within the

L2 cache, thus they are three times faster than in level 2 , where communications

are performed through the L3 cache. In turn, they are 1.5

faster than those in

level 3 of the hierarchy, which are performed by accessing the RAM memory.

×

Search WWH ::

Custom Search

Home