Database Reference
In-Depth Information
5
Contributions to Similarity Search
Apart from the general similarity search that does not require any additional argu-
ments, other cases usually come with their own parameters for specific application
domains. In this paper, we investigate the most popular ones known as pairwise simi-
larity, the pivot document, k-NN query, and range query. For these cases, we show
how adaptable our proposed scheme is by utilizing these provided parameters.
5.1
Pairwise Similarity Case
Pairwise similarity search is also known as self-join similarity search. Documents of
the plain text form are considered as the input of the scheme so-called worksets. The
mappers from MAP-1 method process the worksets and emit intermediate key-value
pairs which have the form of [term k , D i @W i ]. Then, these intermediate key-value
pairs are transferred to the reducers from REDUCE-1 method in order to produce key-
value pairs of the output form [term k , [D i @W i ]], which is also known as an inverted
index. Before producing the output, the Prior Filter is applied to discard duplicate
terms, those are common terms having its inverted document frequency value as 0,
and those cannot contribute to the pair similarity measures. Thus, discarding these
common, duplicate, and lonely words partially help reduce the volume of processing
data. As background computing, the Duplicate Word Filtering works with the form
[term k , D i @W i ] at each mapper while the Common Word Filtering and Lonely Word
Filtering work with the form [term k , [D i @W i @idf ik ]] at each reducer.
Fig. 2. MapReduce-1 operation
For Example. Assuming that there are three documents named D 1 , D 2 , and D 3 . Each
document contains its corresponding words as the input illustrated in Fig. 2. The
mappers from MAP-1 method take the input to emit intermediate key-value pairs.
Then, they are moved to the reducers from REDUCE-1 method to compute the in-
verted document frequency for each term. In this example, duplicate terms B and A in
D 1 , term A whose inverted document frequency is equal to 0.0, and term D which is
not shared with the other documents should be discarded. The other terms as B, C, E,
and F, whose inverted document frequencies are greater than 0.0, will be emitted as
the key-value pairs of the inverted index. In the end of this MapReduce phase, we
Search WWH ::




Custom Search