Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

GeoNames, SF1

Output Size

MRSimJoin

MRThetaJoin

200

120000000

100000000

150

80000000

100

60000000

40000000

50

20000000

0

1

2

3

4

5

Epsilon (miles)

Fig. 12. Increasing Epsilon-GeoNames.

Increasing Number of Nodes and Scale Factor . One of the goals of

MapReduce-based operations is to scale effi ciently when the number of

nodes and the data size increase proportionally. Ideally, the execution time

should remain constant. Figure 13 shows the execution time of MRSimJoin

and MRThetaJoin when the data size and the number of nodes increase from

(SF1, 2 nodes) to (SF5, 10 nodes). MRSimJoin follows the ideal execution

time much more closely than MRThetaJoin. MRSimJoin's execution time for

(SF5, 10) is only 2.1 times the one for (SF1, 2) while MRThetaJoin's execution

time for (SF5, 10) is 4.5 times the one for (SF1, 2). Moreover, the execution

time of MRThetaJoin grows from being 1.8 times the one of MRSimJoin for

(SF1, 2) to 3.8 times for (SF5, 10).

Increasing Number of Pivots . The execution time and number of rounds for

MRSimJoin, as the number of pivots increases, is presented in Fig. 14. The

fi gure shows that a smaller number of pivots generates a higher number of

rounds. We also observe that, in general, the execution time decreases when

the number of rounds decreases. We found that in most of the experiments

presented in this section, the best execution time is achieved using a single

round. To compute the number of pivots ( P ) that will generate a single round

for relatively smaller values of epsilon, i.e., smaller than 25%, we can use the

fact that the space needed for the in-memory QuickJoin algorithm is about

twice the size of the input data (Jacox and Samet 2008). Thus, to ensure that

the average MRSimJoin base partition (and window-pair partition) can be

solved using the in-memory QuickJoin, we need: P = 2 × k × D/T , where

D is the total input size (282 MB), T is the available memory for QuickJoin

(32 MB), and k is a factor that compensates the effect of data skewness on

the size of partitions (we used k = 4 since the real dataset is considerably

skewed). Using this expression, the value of P for this experiment is 70. This

value of P generates a single round and an execution time that is only 6%

higher than the best execution time (obtained with P = 105).

Geographical Information Systems: Trends and Technologies

Search WWH ::

Custom Search

Home