Database Reference
In-Depth Information
word that appears only in the pivot document D 3 . After the first MapReduce operation,
we have some terms in the list of ordered key-value pairs [[B, [D 3 @5, D 1 @3]],
[F, [D 3 @5, D 1 @3]], [E, [D 3 @5, D 2 @2]]]. At the second MapReduce operation as
illustrated in Fig. 5, we have D 1 , D 2 , and D 3 as in turn the candidate similarity pairs.
The new form similarity computation shows that the similarity scores are as follows
[[D 13 , 0.4], [D 23 , 0.2]]. Last but not least, Query Parameter Filtering is optionally used
to filter against range query and k-NN query before generating the final output.
Fig. 5. MapReduce-2 operation with the pivot document D 3
5.3
Range Query Case
From the point of view of range query, a threshold
is provided in order to find those
pairs that have their similarity score greater or equal to the threshold. In general, the
result after two MapReduce operations is filtered against
ε
to find the best fit. In the
case of the pivot document, however, we propose to filter unnecessary pairs before
their similarity computing. In other words, the final result has to satisfy the equation
below which still guarantees the candidate pairs are the super set of the final result:
ε
, || ||
3
Because ||D query || is computed from MAP-1 method, we call the new upper bound
ε
'
the product between the query threshold
and the length of the pivot document. As a
result, the equation (3) has been equivalently transformed into the equation (4) below:
ε
|| ||
| | | | 4
With
It is worth noting that intermediate key-value pairs from REDUCE-1 have the des-
cending order by the length of the list D i a term k has. Therefore, we can make the best
use of Pre-pruning at MAP-2 method to emit those pairs whose comparing documents
have the key value ||D i || greater or equal to the new threshold
ε
', which partly helps
eliminate the amount of unnecessary similarity computations.
5.4
k-NN Query Case
k-NN query looks for the k most similar pairs from the candidate set and can be seen
as a special case of range query. Having the same strategy like in the range query
Search WWH ::




Custom Search