Database Reference
In-Depth Information
Moreover, it is noted that intermediate key-value pairs output from REDUCE-1 have
the descending order by the length of the list D i a term k has and then by the total
words W i of D i in each list. In this way, we want to apply our filtering methods espe-
cially for the pivot document case, with range and k-NN queries in section 5, in order
not to transfer much data over the network. As a consequence, the candidate size is
significantly decreased.
Fig. 1. The overview scheme
The Prior Filter is applied at the MapReduce-1 operation whilst the Query Parame-
ter Filter is attached to the MapReduce-2 operation. The former consists of three sub-
filtering methods known as Duplicate Word Filtering, Common Word Filtering, and
Lonely Word Filtering. Meanwhile, the latter is composed of another three sub-
filtering named Range Query Filtering, Pre-pruning, and k-NN Query Filtering. These
filtering methods are alternatively combined to support specific similarity search sce-
narios. In general, the proposed scheme is not limited to be applied for various simi-
larity search strategies as discussed in section 5 of this paper. For simplicity, we
present how the proposed scheme at first works for pairwise document similarity
search in sub-section 5.1, and then we show how our scheme is effectively adapt itself
to other similarity search parameters in the remaining sub-sections.
Let D i be the i th document of the workset, W i be the total words of D i , n be the ac-
cumulated number of the same key, and sim(D i , D j ) be the similarity score between a
document pair. The two MapReduce operations can be summarized as follows:
MAP-1: , @
REDUCE-1: , @ , @
MAP-2: , @ @ @ ,
REDUCE-2: @ @ , , ,
Search WWH ::




Custom Search