Database Reference
In-Depth Information
The value of sim(X, Y) belongs to the interval [0, 1]. If two sets X and Y are simi-
lar to each other, the value of sim(X, Y) is close to 1. Otherwise, the value of sim
(X, Y) is close to 0.
3.2
MapReduce Paradigm
MapReduce whose details can be seen in [9] is a parallel programming paradigm that
helps solve the problem of data scalability. The fundamental idea of MapReduce start-
ing from functional programming languages is focused on two basis jobs named Map
and Reduce. The paradigm is deployed on commodity machines in that one plays a
role of master and the others take part in as workers. The master node takes its
responsibility to assign Map and Reduce jobs to workers. The machines which are
assigned Map task are called mappers whereas the machines which are assigned Re-
duce task are known as reducers. There are M mappers to execute the Map job and R
reducers to execute the Reduce job. The Map job is defined by the Map function and
the Reduce job is defined by the Reduce function.
A single flow of MapReduce can be briefly summarized as following steps: (1)
The input is partitioned in a distributed file system like Hadoop Distributed File Sys-
tem [3] and has the form of [key 1 , value 1 ]; (2) Mappers do its assigned tasks from the
Map function and produce intermediate key-value pairs of the form [key 2 , value 2 ]; (3)
The shuffling process from the combine method is conducted to group these pairs
according to their key; (4) Reducers receive [key 2 , [value 2 ]], apply the Reduce func-
tion, and output the final result, which might have the form [result]; and (5) The out-
put is finally written back in the distributed file system.
4
The Proposed Scheme
We start by introducing the simple yet efficient scheme. As illustrated in Fig. 1, the
whole process is composed of two MapReduce operations as follows: (1) the first
MapReduce operation takes worksets as its input, along with a given query document
as a pivot, and then generates a customized inverted index; and (2) the second Ma-
pReduce operation exploits the inverted index to compute similarity pairs. Thanks to
MapReduce paradigm, each MapReduce phase helps us solve a large volume of data
by processing portions of data in parallel. Nevertheless, the more redundant data we
eliminate, the better performance we get. On the one hand, the inverted index seems
to be a good way to find overlapped sections of each pair. On the other hand, it suffers
its long length from useless words that do not contribute much to the similarity score
but appear frequently among documents, which makes the inverted index bulky. In
addition, such a naïve approach is a waste of computation effort due to generating all
candidate pairs and building the full inverted index. Thus, we propose two additional
filtering strategies corresponding to each phase, known as Prior Filter and Query Pa-
rameter Filter, in order to reduce candidate size leading to eliminating unnecessary
pair-similarity computing and avoiding further space cost. In other words, those pairs
which are dissimilar to each other should not be computed during the whole process.
Doing so partly reduce overheads and improve the performance phase-by-phase.
Search WWH ::




Custom Search