An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

Moreover, it is noted that intermediate key-value pairs output from REDUCE-1 have

the descending order by the length of the list D i a term k has and then by the total

words W i of D i in each list. In this way, we want to apply our filtering methods espe-

cially for the pivot document case, with range and k-NN queries in section 5, in order

not to transfer much data over the network. As a consequence, the candidate size is

significantly decreased.

Fig. 1. The overview scheme

The Prior Filter is applied at the MapReduce-1 operation whilst the Query Parame-

ter Filter is attached to the MapReduce-2 operation. The former consists of three sub-

filtering methods known as Duplicate Word Filtering, Common Word Filtering, and

Lonely Word Filtering. Meanwhile, the latter is composed of another three sub-

filtering named Range Query Filtering, Pre-pruning, and k-NN Query Filtering. These

filtering methods are alternatively combined to support specific similarity search sce-

narios. In general, the proposed scheme is not limited to be applied for various simi-

larity search strategies as discussed in section 5 of this paper. For simplicity, we

present how the proposed scheme at first works for pairwise document similarity

search in sub-section 5.1, and then we show how our scheme is effectively adapt itself

to other similarity search parameters in the remaining sub-sections.

Let D i be the i th document of the workset, W i be the total words of D i , n be the ac-

cumulated number of the same key, and sim(D i , D j ) be the similarity score between a

document pair. The two MapReduce operations can be summarized as follows:

MAP-1: , @

REDUCE-1: , @ , @

MAP-2: , @ @ @ ,

REDUCE-2: @ @ , , ,

Search WWH ::

Custom Search

Home