Database Reference
In-Depth Information
Researchers not only from academia but also from industry have focused on how
to tackle the issue for many years. Some try to propose novel indexes or ways to ap-
proximately but efficiently do similarity search [5][6]. Recent state-of-the-arts pay
attention to parallelism by optimizing algorithms in parallel [1] or employing MapRe-
duce [9] to improve similarity search for large scale data [4][8][12][13][14]. In this
paper, we propose an elastic scheme dealing with massive datasets. Our main contri-
butions are as follows:
1. A general approximate similarity search scheme with MapReduce is proposed
towards data scalability.
2. Collaborative refinement strategies are exploited to meet users' search needs
and reduce candidate size, which leads to eliminating unnecessary similarity
computing and turning storage overheads down.
3. The scheme is then shown how much adaptable it is to specific similarity
search including pairwise documents search, pivot document search, range
search, and k-Nearest Neighbor (k-NN) search.
4. Experiments are conducted with Apache Hadoop framework [3] and real large
datasets to indicate which can benefit from the proposed methods when large
amounts of data keep increasing.
The rest of paper is organized as follows. Section 2 introduces relevant researches
in the field of similarity search which are close to our work. Section 3 presents some
preliminaries behind our proposed similarity search methods. Next, section 4 shows
our general scheme with MapReduce while section 5 discusses about how the general
scheme adapts to the four most common similarity search strategies. Section 6 gives
our experiments before we conclude our work and its future in section 7.
2
Related Work
The importance of similarity search as well as its issues has gained much attention
and effort from researchers world-wide, especially when the data become oversized.
While the traditional exact similarity search demands full computations that has the
complexity as O(n 2 ), much work tries to reduce the search space by filtering methods
or approximately achieves similarity to improve the performance. In [11], the authors
propose State Set Index, which is interpreted as a nondeterministic finite automaton,
as a method for fast similarity search in very large string sets. Another approach com-
ing from the work in [15] utilizes positional filtering principle which is then combined
with prefix and suffix filtering. The work in [16] presents the combination between
index structure based method for prune strategy and hashing based method for fast
distance computation. Nevertheless, only k-NN query is considered. In general, these
methods are sequentially done without caring about parallel mechanism.
Still there is other work exploiting parallelism to do similarity search. Like in [1],
the authors conduct exact all-pair similarity search. Nevertheless, it requires a static
partitioning to assure that dissimilar vectors are assigned to different partitions before
computing similarity pairs. The authors in [10] introduce a MapReduce algorithm to
compute the similarity score between pairs in large document collections. The algo-
rithm consists of two MapReduce phases which first create an inverted index and then
Search WWH ::




Custom Search