Global Positioning System Reference
In-Depth Information
GeoNames, SF1
Output Size
MRSimJoin
MRThetaJoin
200
120000000
100000000
150
80000000
100
60000000
40000000
50
20000000
0
0
1
2
3
4
5
Epsilon (miles)
Fig. 12. Increasing Epsilon-GeoNames.
Increasing Number of Nodes and Scale Factor . One of the goals of
MapReduce-based operations is to scale effi ciently when the number of
nodes and the data size increase proportionally. Ideally, the execution time
should remain constant. Figure 13 shows the execution time of MRSimJoin
and MRThetaJoin when the data size and the number of nodes increase from
(SF1, 2 nodes) to (SF5, 10 nodes). MRSimJoin follows the ideal execution
time much more closely than MRThetaJoin. MRSimJoin's execution time for
(SF5, 10) is only 2.1 times the one for (SF1, 2) while MRThetaJoin's execution
time for (SF5, 10) is 4.5 times the one for (SF1, 2). Moreover, the execution
time of MRThetaJoin grows from being 1.8 times the one of MRSimJoin for
(SF1, 2) to 3.8 times for (SF5, 10).
Increasing Number of Pivots . The execution time and number of rounds for
MRSimJoin, as the number of pivots increases, is presented in Fig. 14. The
fi gure shows that a smaller number of pivots generates a higher number of
rounds. We also observe that, in general, the execution time decreases when
the number of rounds decreases. We found that in most of the experiments
presented in this section, the best execution time is achieved using a single
round. To compute the number of pivots ( P ) that will generate a single round
for relatively smaller values of epsilon, i.e., smaller than 25%, we can use the
fact that the space needed for the in-memory QuickJoin algorithm is about
twice the size of the input data (Jacox and Samet 2008). Thus, to ensure that
the average MRSimJoin base partition (and window-pair partition) can be
solved using the in-memory QuickJoin, we need: P = 2 × k × D/T , where
D is the total input size (282 MB), T is the available memory for QuickJoin
(32 MB), and k is a factor that compensates the effect of data skewness on
the size of partitions (we used k = 4 since the real dataset is considerably
skewed). Using this expression, the value of P for this experiment is 70. This
value of P generates a single round and an execution time that is only 6%
higher than the best execution time (obtained with P = 105).
 
Search WWH ::




Custom Search