Database Reference
In-Depth Information
and pre-pruning processes to the pivot document case. The former takes place at the
first MapReduce operation while the latter takes place at the second one.
Those terms which do not appear in the pivot document or those which are con-
tained only in the pivot document will be assessed as lonely terms, and they should be
ignored. It is possible because lonely terms do not contribute much to the final simi-
larity score. Therefore, we can approximate the final similarity score by discarding
them. When it happens, the inverted index includes terms that appear in the pivot
document. On the one hand, we reduce the number of unnecessary terms at the very
beginning to reduce overheads. On the other hand, the Jaccard similarity score be-
tween the comparing document D i and the pivot document D query can be re-innovated
as the equation below:
, || ||
2
In the equation (2), ||D query || denotes the original length of the pivot document and
||D i || represents the number of term k it contains. In other words, ∑term k yields ||D i ||.
The new form of similarity measure is approximately computed by dividing the
length of the comparing document by the length of the pivot one. In addition, this new
form is utilized by Pre-pruning at MAP-2 to soon eliminate unqualified candidate
pairs. More details of Pre-pruning are discussed in section 5.3 for range query case
and section 5.4 for k-NN query case. Consequently, those pairs compared with the
pivot document and satisfying Pre-pruning constraints are emitted by MAP-2 method.
Fig. 4. MapReduce-1 operation with the pivot document D 3
For Example. Considering another example and assuming that D 3 is the pivot docu-
ment as illustrated in Fig. 4. That means we want to search other most similarity
documents with D 3 . The whole process is still the same as in the pairwise similarity
case except for added lonely term removal, pre-pruning, and the new form of Jaccard
similarity measure. In this example, we have ||D 3 || is 5. At the mapper side, duplicate
terms B and A in D 1 are disposed by Duplicate Word Filtering while term C, in both
D 1 and D 2 , is discarded by Lonely Word Filtering-1 because it is a lonely word that
does not appear in the pivot document D 3 . Alternatively, Common Word Filtering-1
works with a pre-defined dictionary to pre-filter common words in advance. At the
reducer side, term A is removed by Common Word Filtering-2 because it is a common
word whereas term D is ignored by Lonely Term Filtering-2 because it is a lonely
Search WWH ::




Custom Search