An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

and pre-pruning processes to the pivot document case. The former takes place at the

first MapReduce operation while the latter takes place at the second one.

Those terms which do not appear in the pivot document or those which are con-

tained only in the pivot document will be assessed as lonely terms, and they should be

ignored. It is possible because lonely terms do not contribute much to the final simi-

larity score. Therefore, we can approximate the final similarity score by discarding

them. When it happens, the inverted index includes terms that appear in the pivot

document. On the one hand, we reduce the number of unnecessary terms at the very

beginning to reduce overheads. On the other hand, the Jaccard similarity score be-

tween the comparing document D i and the pivot document D query can be re-innovated

as the equation below:

, || ||

2

In the equation (2), ||D query || denotes the original length of the pivot document and

||D i || represents the number of term k it contains. In other words, ∑term k yields ||D i ||.

The new form of similarity measure is approximately computed by dividing the

length of the comparing document by the length of the pivot one. In addition, this new

form is utilized by Pre-pruning at MAP-2 to soon eliminate unqualified candidate

pairs. More details of Pre-pruning are discussed in section 5.3 for range query case

and section 5.4 for k-NN query case. Consequently, those pairs compared with the

pivot document and satisfying Pre-pruning constraints are emitted by MAP-2 method.

Fig. 4. MapReduce-1 operation with the pivot document D 3

For Example. Considering another example and assuming that D 3 is the pivot docu-

ment as illustrated in Fig. 4. That means we want to search other most similarity

documents with D 3 . The whole process is still the same as in the pairwise similarity

case except for added lonely term removal, pre-pruning, and the new form of Jaccard

similarity measure. In this example, we have ||D 3 || is 5. At the mapper side, duplicate

terms B and A in D 1 are disposed by Duplicate Word Filtering while term C, in both

D 1 and D 2 , is discarded by Lonely Word Filtering-1 because it is a lonely word that

does not appear in the pivot document D 3 . Alternatively, Common Word Filtering-1

works with a pre-defined dictionary to pre-filter common words in advance. At the

reducer side, term A is removed by Common Word Filtering-2 because it is a common

word whereas term D is ignored by Lonely Term Filtering-2 because it is a lonely

Search WWH ::

Custom Search

Home