An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

Fig. 6. Pairwise similarity case between the naïve approach and the proposed approach; (a)

Query processing time; and (b) The saved data volume

From the point of view of the pivot document case showed in Fig. 7, the dataset

size grows from 900MB to 4500MB and mostly reaches the maximum capacity for

each node. In Fig. 7a, there is not much difference in performance among all-pairs,

range query whose threshold is set to 90% of similarity, and 100-NN query which are

based on the proposed methods. Besides, Fig. 7b points out the percentage of saved

data volume which will be increased further if query parameters are more given.

Fig. 7. Pivot document, range query, and k-NN query cases; (a) Query processing time; and (b)

The saved data volume

Moreover, what are inherited from the experiments are MapReduce operations

should not be too complex due to the lack of memory and the reduction of candidate

size is essential because it helps reduce the number of volumes written in HDFS file

systems during MapReduce operations, which can lead to achieve high performance.

7

Conclusion and Future Work

In this paper, we propose an elastic approximate similarity search with MapReduce to

primarily deal with scalability. We show how our search scheme is specifically tai-

lored for the four most popular similarity search scenarios known as pairwise docu-

ments similarity, pivot document search, range query, and k-NN query. In addition,

our strategic filtering methods which promote potential scalability of MapReduce

help reduce the size of candidate pairs and eliminate unnecessary computations as

well as space overheads. Moreover, we conduct experiments with real massive data-

sets and Hadoop framework to verify these methods.

For our future work, we model worksets as distinct n-grams instead of terms and

extend our methods to other metrics to achieve more efficiency. Besides, we also

generalize our approach more concretely to the incremental case whose data are on

the fly. Furthermore, we concentrate on resolving other factors under the big data

context such as the velocity and variety of big data in order to consolidate our me-

thods and look forward to a unified solution supporting data-intensive applications.

Search WWH ::

Custom Search

Home