An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

The value of sim(X, Y) belongs to the interval [0, 1]. If two sets X and Y are simi-

lar to each other, the value of sim(X, Y) is close to 1. Otherwise, the value of sim

(X, Y) is close to 0.

3.2

MapReduce Paradigm

MapReduce whose details can be seen in [9] is a parallel programming paradigm that

helps solve the problem of data scalability. The fundamental idea of MapReduce start-

ing from functional programming languages is focused on two basis jobs named Map

and Reduce. The paradigm is deployed on commodity machines in that one plays a

role of master and the others take part in as workers. The master node takes its

responsibility to assign Map and Reduce jobs to workers. The machines which are

assigned Map task are called mappers whereas the machines which are assigned Re-

duce task are known as reducers. There are M mappers to execute the Map job and R

reducers to execute the Reduce job. The Map job is defined by the Map function and

the Reduce job is defined by the Reduce function.

A single flow of MapReduce can be briefly summarized as following steps: (1)

The input is partitioned in a distributed file system like Hadoop Distributed File Sys-

tem [3] and has the form of [key 1 , value 1 ]; (2) Mappers do its assigned tasks from the

Map function and produce intermediate key-value pairs of the form [key 2 , value 2 ]; (3)

The shuffling process from the combine method is conducted to group these pairs

according to their key; (4) Reducers receive [key 2 , [value 2 ]], apply the Reduce func-

tion, and output the final result, which might have the form [result]; and (5) The out-

put is finally written back in the distributed file system.

4

The Proposed Scheme

We start by introducing the simple yet efficient scheme. As illustrated in Fig. 1, the

whole process is composed of two MapReduce operations as follows: (1) the first

MapReduce operation takes worksets as its input, along with a given query document

as a pivot, and then generates a customized inverted index; and (2) the second Ma-

pReduce operation exploits the inverted index to compute similarity pairs. Thanks to

MapReduce paradigm, each MapReduce phase helps us solve a large volume of data

by processing portions of data in parallel. Nevertheless, the more redundant data we

eliminate, the better performance we get. On the one hand, the inverted index seems

to be a good way to find overlapped sections of each pair. On the other hand, it suffers

its long length from useless words that do not contribute much to the similarity score

but appear frequently among documents, which makes the inverted index bulky. In

addition, such a naïve approach is a waste of computation effort due to generating all

candidate pairs and building the full inverted index. Thus, we propose two additional

filtering strategies corresponding to each phase, known as Prior Filter and Query Pa-

rameter Filter, in order to reduce candidate size leading to eliminating unnecessary

pair-similarity computing and avoiding further space cost. In other words, those pairs

which are dissimilar to each other should not be computed during the whole process.

Doing so partly reduce overheads and improve the performance phase-by-phase.

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home