An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

An Elastic Approximate Similarity Search

in Very Large Datasets with MapReduce

Trong Nhan Phan 1 , Josef Küng 1 , and Tran Khanh Dang 2

1 FAW Institute, Johannes Kepler University Linz, Austria

{nphan,jkueng}@faw.jku.at

2 HCMC University of Technology, Ho Chi Minh City, Vietnam

khanh@cse.hcmut.edu.vn

Abstract. The outbreak of data brings an era of big data and more challenges

than ever before to traditional similarity search which has been spread to a wide

range of applications. Furthermore, an unprecedented scale of data being proc-

essed may be infeasible or may lead to the paralysis of systems due to the slow

performance and high overheads. Dealing with such an unstoppable data growth

paves the way not only to similarity search consolidates but also to new trends

of data-intensive applications. Aiming at scalability, we propose an elastic ap-

proximate similarity search that efficiently works in very large datasets. More-

over, our proposed scheme effectively adapts itself to the well-known similarity

searches with pairwise documents, pivot document, range query, and k-nearest

neighbour query. Last but not least, these methods, together with our filtering

strategies, are implemented and verified by experiments on real large data col-

lections in Hadoop showing their promising effectiveness and efficiency.

Keywords: Similarity search, approximate search, very large datasets, MapRe-

duce, Jaccard measure, Hadoop.

1

Introduction

Similarity search has been widely used and played an essential role supporting a va-

riety of applications. The basic goal is to find possibly relevant results whenever a

query sent by a user. In fact, many application scenarios want to retrieve objects that

look like a given one or find the most similar objects in databases. Some typical in-

stances include plagiarism detection, recommendation systems, and data cleaning or

clustering in data mining, to name a few. Nevertheless, similarity search has encoun-

tered a barrier caused by a big volume of data. These large datasets, from web

crawlers, data logs, or click streams for example, may be processed up to hundreds of

terabytes of data. In order to identify how similar a pair is, quantitative methods

known as similarity functions, metrics, or distance measures are applied. It also means

computation costs are additionally added to derive similarity scores. In other words, if

we have n objects, the cost for pairwise similarity computing is O(n 2 ). Such a high

cost would be a burden to the system when objects never stop quickly growing, or

they are distributed over a big number of computing nodes.

Search WWH ::

Custom Search

Home