An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

Researchers not only from academia but also from industry have focused on how

to tackle the issue for many years. Some try to propose novel indexes or ways to ap-

proximately but efficiently do similarity search [5][6]. Recent state-of-the-arts pay

attention to parallelism by optimizing algorithms in parallel [1] or employing MapRe-

duce [9] to improve similarity search for large scale data [4][8][12][13][14]. In this

paper, we propose an elastic scheme dealing with massive datasets. Our main contri-

butions are as follows:

1. A general approximate similarity search scheme with MapReduce is proposed

towards data scalability.

2. Collaborative refinement strategies are exploited to meet users' search needs

and reduce candidate size, which leads to eliminating unnecessary similarity

computing and turning storage overheads down.

3. The scheme is then shown how much adaptable it is to specific similarity

search including pairwise documents search, pivot document search, range

search, and k-Nearest Neighbor (k-NN) search.

4. Experiments are conducted with Apache Hadoop framework [3] and real large

datasets to indicate which can benefit from the proposed methods when large

amounts of data keep increasing.

The rest of paper is organized as follows. Section 2 introduces relevant researches

in the field of similarity search which are close to our work. Section 3 presents some

preliminaries behind our proposed similarity search methods. Next, section 4 shows

our general scheme with MapReduce while section 5 discusses about how the general

scheme adapts to the four most common similarity search strategies. Section 6 gives

our experiments before we conclude our work and its future in section 7.

2

Related Work

The importance of similarity search as well as its issues has gained much attention

and effort from researchers world-wide, especially when the data become oversized.

While the traditional exact similarity search demands full computations that has the

complexity as O(n 2 ), much work tries to reduce the search space by filtering methods

or approximately achieves similarity to improve the performance. In [11], the authors

propose State Set Index, which is interpreted as a nondeterministic finite automaton,

as a method for fast similarity search in very large string sets. Another approach com-

ing from the work in [15] utilizes positional filtering principle which is then combined

with prefix and suffix filtering. The work in [16] presents the combination between

index structure based method for prune strategy and hashing based method for fast

distance computation. Nevertheless, only k-NN query is considered. In general, these

methods are sequentially done without caring about parallel mechanism.

Still there is other work exploiting parallelism to do similarity search. Like in [1],

the authors conduct exact all-pair similarity search. Nevertheless, it requires a static

partitioning to assure that dissimilar vectors are assigned to different partitions before

computing similarity pairs. The authors in [10] introduce a MapReduce algorithm to

compute the similarity score between pairs in large document collections. The algo-

rithm consists of two MapReduce phases which first create an inverted index and then

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home