Information Technology Reference
In-Depth Information
(#3), eight components from level
2
are grouped. They share the RAM memory,
with a size of 128GB, as specified by
tuple
3
=
p
3
=8
,m
3
= 128GB
,g
3
,L
3
.
Finally, using the same procedure we previously applied to the dell32 ar-
chitecture (i.e. joining all tuples and discarding level
0
), we get the MultiBSP
specification in Eq. 2.
M
2
=[
p
1
=2
,m
1
=2MB
,g
1
,L
1
, p
2
=4
,m
2
=6MB
,g
2
,L
2
,
p
3
=8
,m
3
= 128 GB
,g
3
,L
3
] )
Using these instances of the MultiBSP model, we can predict the running time
of a MultiBSP algorithm executed in each machine. The
g
i
and
L
i
parameters
in each tuple must be previously calculated using the benchmarking procedure
explained in the previous section. Next section reports the values of
g
and
L
obtained for both architectures at each level.
4.2 Results
We report the time to perform
h
-communications in each level, increasing the
number
h
as in the
coreBenchmark
function. Reporting the flops for each
h
-
communications is important because we compute the
g
i
and
L
i
using least
squares to estimate the parameters at each level.
(a) Instance #1: dell32
(b) Instance #2: jolly
Fig. 7.
Time to perform from
h
-communications per level in a MultiBSP tree, with
h
between 0 and 256
Figure 7 show the
h
i
communications in each level for
dell32
(level
1
and level
2
)
and
jolly
(levels 1, 2, and 3). In level
1
of
dell32
, the communications are within
the shared memory (L3 cache), so they are twice faster than in level
2
,which
use the RAM memory. For
jolly
, the communications in level
1
are within the
L2 cache, thus they are three times faster than in level
2
, where communications
are performed through the L3 cache. In turn, they are 1.5
faster than those in
level
3
of the hierarchy, which are performed by accessing the RAM memory.
×