An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

case, we utilize the MAP-2 method to filter candidate pairs against the k parameter

before their similarity score is computed. In this case, the mappers from MAP-2 me-

thod with Pre-pruning only emit top-k intermediate key-value pairs. For approximate

k-NN query, however, each mapper can emit top-pairs whose size is according to the

total number of running mappers as the equation (5) below:

∑ ,1 5

toppairs

max

6

Experiments

6.1

Environment Settings

The experiments are set up in the cluster of commodity machines called Alex, which

has 48 nodes and 8 CPU cores and either 96 or 48 GB RAM for each node [2]. The

stable Hadoop has the version1.2.1, and DBLP dataset [7] is used to do similarity

search on the title of publications. In general, we leave other Hadoop configurations

as its default. We want to preserve the most common settings which commodity ma-

chines may initially get even though some parameters could be tuned or optimized to

fit the Alex cluster. The configured capacity is 5GB per node, so there totally is

240GB for the 48-node cluster. The number of reducers is set to 168. In addition, the

replication factor is set to 47. The possible heap size of the cluster is about 629 MB,

and each HDFS file has 64MB Block Size. It is worth noting that Alex has suffered

the overhead of other coordinating parallel tasks, i.e., these nodes are not exclusively

for the experiments. In addition, they are diskless nodes, but in the background data

are located on a storage area network. Each benchmark, therefore, is run 10 times to

obtain the average values. Last but not least, each benchmark has its fresh running. In

other words, data from the old benchmark are removed before the new benchmark

starts. All the experiments for one type of query are consecutively run so that their

environments are close as much as possible.

6.2

Empirical Evaluation

Fig. 6a represents the query processing time of pairwise similarity case between the

naïve approach, which uses MapReduce to compute all possible pairs, and the pro-

posed approach. The difference between two approaches is not so big when the

benchmark size is under a specific threshold. The reason is that they have to suffer

from the operation cost of the whole system. When the dataset size significantly in-

creases, the query processing time of the proposed method is not so much as that of

the naïve method. In other words, when the dataset size is large enough, there is a

very big gap between them, which indicates the query processing time of the naïve

approach consumes much more while the proposed method outperforms the naïve

method in its performance from 5% to 39% when the dataset size grows from 50MB

to 500MB. On the other hand, the percentage of saved data volume during the compu-

tation phases ranges from 67% to 82%, respectively in Fig. 6b.

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home