Database Reference
In-Depth Information
finally have the inverted index of the list as following [[B, [D 3 @5, D 1 @4]], [F,
[D 3 @5, D 1 @4]], [E, [D 3 @5, D 2 @3]], [C, [D 1 @4, D 2 @3]]].
After the inverted index is derived from the first MapReduce operation, the second
one will be conducted to compute the similarity score. The inverted index goes
through the mappers from MAP-2 method which will emit intermediate key-value
pairs of the form [D i , [D ij @W i @W j , n]] after shuffling. It is worth noting that the
COMBINE function is additionally used to pre-group the intermediate key-value pairs
so that the amount of data transferred to the reducers is further reduced. Then the
intermediate key-value pairs are retrieved by the reducers from REDUCE-2 method
and the final key-value pairs with their similarity scores of the form [D i D j , sim(D i ,
D j )] will be computed and ranked according to the scores. Finally, the result can be
filtered against query parameters like the threshold similarity
ε
or the top k-similarity
pairs to meet users' search requirements.
Fig. 3. MapReduce-2 operation
For Example. Fig. 3 indicates the next phase of the previous example. The output of
the first MapReduce operation will be considered as the input for the second one. In
other words, the inverted index [[B, [D 3 @5, D 1 @4]], [F, [D 3 @5, D 1 @4]], [E, [D 3 @5,
D 2 @3]], [C, [D 1 @4, D 2 @3]]] is processed by the mappers from MAP-2 method to
produce the independent intermediate key-value pairs [D 31 @5@4, 1], [D 31 @5@4, 1],
[D 32 @5@3, 1], and [D 12 @4@3, 1]. The combine function groups these pairs accord-
ing to the key value, and we have the ordered shorter list [[D 31 @5@4, 2], [D 32 @5@3,
1], [D 12 @4@3, 1]]. Then, the reducers from REDUCE-2 method will calculate the
similarity score between the pairs. Finally, we have the ordered result list [[D 31 ,
0.2857], [D 12 , 0.1666], [D 32 , 0.1429]] which shows the most similarity pair is D 1 and
D 3 with the score 0.2857, the pair D 1 and D 2 is with the score 0.1666, and the pair D 2
and D 3 is with the score 0.1429.
5.2
Pivot Document Case
Normally, the query document is issued in order to search other most similarity ones
from the database. For instance, given a sample document, we would like to search
which documents are the most similar with the provided document. This scenario is
so common to most of search applications. The query document is then called the
pivot document because we use it as the anchor to compare with others. Once the
pivot document is known, we are able to employ it to improve the performance. In
this case, the whole follows the general scheme. Nevertheless, we flexibly use Prior
Filter between MAP and REDUCE tasks and additionally attach lonely term removal
Search WWH ::




Custom Search