An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

finally have the inverted index of the list as following [[B, [D 3 @5, D 1 @4]], [F,

[D 3 @5, D 1 @4]], [E, [D 3 @5, D 2 @3]], [C, [D 1 @4, D 2 @3]]].

After the inverted index is derived from the first MapReduce operation, the second

one will be conducted to compute the similarity score. The inverted index goes

through the mappers from MAP-2 method which will emit intermediate key-value

pairs of the form [D i , [D ij @W i @W j , n]] after shuffling. It is worth noting that the

COMBINE function is additionally used to pre-group the intermediate key-value pairs

so that the amount of data transferred to the reducers is further reduced. Then the

intermediate key-value pairs are retrieved by the reducers from REDUCE-2 method

and the final key-value pairs with their similarity scores of the form [D i D j , sim(D i ,

D j )] will be computed and ranked according to the scores. Finally, the result can be

filtered against query parameters like the threshold similarity

ε

or the top k-similarity

pairs to meet users' search requirements.

Fig. 3. MapReduce-2 operation

For Example. Fig. 3 indicates the next phase of the previous example. The output of

the first MapReduce operation will be considered as the input for the second one. In

other words, the inverted index [[B, [D 3 @5, D 1 @4]], [F, [D 3 @5, D 1 @4]], [E, [D 3 @5,

D 2 @3]], [C, [D 1 @4, D 2 @3]]] is processed by the mappers from MAP-2 method to

produce the independent intermediate key-value pairs [D 31 @5@4, 1], [D 31 @5@4, 1],

[D 32 @5@3, 1], and [D 12 @4@3, 1]. The combine function groups these pairs accord-

ing to the key value, and we have the ordered shorter list [[D 31 @5@4, 2], [D 32 @5@3,

1], [D 12 @4@3, 1]]. Then, the reducers from REDUCE-2 method will calculate the

similarity score between the pairs. Finally, we have the ordered result list [[D 31 ,

0.2857], [D 12 , 0.1666], [D 32 , 0.1429]] which shows the most similarity pair is D 1 and

D 3 with the score 0.2857, the pair D 1 and D 2 is with the score 0.1666, and the pair D 2

and D 3 is with the score 0.1429.

5.2

Pivot Document Case

Normally, the query document is issued in order to search other most similarity ones

from the database. For instance, given a sample document, we would like to search

which documents are the most similar with the provided document. This scenario is

so common to most of search applications. The query document is then called the

pivot document because we use it as the anchor to compare with others. Once the

pivot document is known, we are able to employ it to improve the performance. In

this case, the whole follows the general scheme. Nevertheless, we flexibly use Prior

Filter between MAP and REDUCE tasks and additionally attach lonely term removal

Search WWH ::

Custom Search

Home