Database Reference
In-Depth Information
An Elastic Approximate Similarity Search
in Very Large Datasets with MapReduce
Trong Nhan Phan 1 , Josef Küng 1 , and Tran Khanh Dang 2
1 FAW Institute, Johannes Kepler University Linz, Austria
{nphan,jkueng}@faw.jku.at
2 HCMC University of Technology, Ho Chi Minh City, Vietnam
khanh@cse.hcmut.edu.vn
Abstract. The outbreak of data brings an era of big data and more challenges
than ever before to traditional similarity search which has been spread to a wide
range of applications. Furthermore, an unprecedented scale of data being proc-
essed may be infeasible or may lead to the paralysis of systems due to the slow
performance and high overheads. Dealing with such an unstoppable data growth
paves the way not only to similarity search consolidates but also to new trends
of data-intensive applications. Aiming at scalability, we propose an elastic ap-
proximate similarity search that efficiently works in very large datasets. More-
over, our proposed scheme effectively adapts itself to the well-known similarity
searches with pairwise documents, pivot document, range query, and k-nearest
neighbour query. Last but not least, these methods, together with our filtering
strategies, are implemented and verified by experiments on real large data col-
lections in Hadoop showing their promising effectiveness and efficiency.
Keywords: Similarity search, approximate search, very large datasets, MapRe-
duce, Jaccard measure, Hadoop.
1
Introduction
Similarity search has been widely used and played an essential role supporting a va-
riety of applications. The basic goal is to find possibly relevant results whenever a
query sent by a user. In fact, many application scenarios want to retrieve objects that
look like a given one or find the most similar objects in databases. Some typical in-
stances include plagiarism detection, recommendation systems, and data cleaning or
clustering in data mining, to name a few. Nevertheless, similarity search has encoun-
tered a barrier caused by a big volume of data. These large datasets, from web
crawlers, data logs, or click streams for example, may be processed up to hundreds of
terabytes of data. In order to identify how similar a pair is, quantitative methods
known as similarity functions, metrics, or distance measures are applied. It also means
computation costs are additionally added to derive similarity scores. In other words, if
we have n objects, the cost for pairwise similarity computing is O(n 2 ). Such a high
cost would be a burden to the system when objects never stop quickly growing, or
they are distributed over a big number of computing nodes.
 
Search WWH ::




Custom Search