An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

5

Contributions to Similarity Search

Apart from the general similarity search that does not require any additional argu-

ments, other cases usually come with their own parameters for specific application

domains. In this paper, we investigate the most popular ones known as pairwise simi-

larity, the pivot document, k-NN query, and range query. For these cases, we show

how adaptable our proposed scheme is by utilizing these provided parameters.

5.1

Pairwise Similarity Case

Pairwise similarity search is also known as self-join similarity search. Documents of

the plain text form are considered as the input of the scheme so-called worksets. The

mappers from MAP-1 method process the worksets and emit intermediate key-value

pairs which have the form of [term k , D i @W i ]. Then, these intermediate key-value

pairs are transferred to the reducers from REDUCE-1 method in order to produce key-

value pairs of the output form [term k , [D i @W i ]], which is also known as an inverted

index. Before producing the output, the Prior Filter is applied to discard duplicate

terms, those are common terms having its inverted document frequency value as 0,

and those cannot contribute to the pair similarity measures. Thus, discarding these

common, duplicate, and lonely words partially help reduce the volume of processing

data. As background computing, the Duplicate Word Filtering works with the form

[term k , D i @W i ] at each mapper while the Common Word Filtering and Lonely Word

Filtering work with the form [term k , [D i @W i @idf ik ]] at each reducer.

Fig. 2. MapReduce-1 operation

For Example. Assuming that there are three documents named D 1 , D 2 , and D 3 . Each

document contains its corresponding words as the input illustrated in Fig. 2. The

mappers from MAP-1 method take the input to emit intermediate key-value pairs.

Then, they are moved to the reducers from REDUCE-1 method to compute the in-

verted document frequency for each term. In this example, duplicate terms B and A in

D 1 , term A whose inverted document frequency is equal to 0.0, and term D which is

not shared with the other documents should be discarded. The other terms as B, C, E,

and F, whose inverted document frequencies are greater than 0.0, will be emitted as

the key-value pairs of the inverted index. In the end of this MapReduce phase, we

Search WWH ::

Custom Search

Home