An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

take pairwise similarity into account. Nevertheless, the algorithm only considers term

frequency between documents as weights accumulating to the final similarity score by

their inner product. This way does not bring much accuracy to the final score measur-

ing similarity. Furthermore, the authors do not mention any methods to reduce the

size of candidate pairs. That means each document in the large collections is com-

pared one by one, which brings quadratic computing costs when the size of data

rapidly increases. From the work in [12], the authors also use MapReduce as the ske-

leton of their approach. The costs for MapReduce operations, however, are so high

due to so many phases. There are two MapReduce jobs to construct the dictionary,

one MapReduce job to transform texts into vector texts, one MapReduce job to calcu-

late the inverted file, and two MapReduce jobs to resolve the query text. Moreover,

the big size of the prefixes will cause computing overheads and slow down the system

when massive datasets are given. The authors in [4] propose 2-phase MapReduce for

the self-join problem. The first MapReduce job is to build an inverted index whereas

the second MapReduce job is to calculate similarity scores based on partial contribu-

tions. Their approach requires overriding the group operator and partitioning function

of Hadoop while our approach employs the most fundamental principle of MapRe-

duce, which is originally supported by Hadoop and has wider adoption. Last but not

least, our work also aims at reducing candidate size before computing partial results

which are then accumulated at Reduce task. We will show how we make the best use

of our filtering strategies in comparison with other work.

3

Preliminaries

3.1

Concepts

Given a set of documents known as worksets Ω = {D 1 , D 2 , D 3 , …, D n }, and each doc-

ument which has the plain text form contains a set of terms D i = {term 1 , term 2 ,

term 3 , …, term k }. The symbol [,] represents the list of x elements while the symbol

[[,], [,]] indicates the list of lists. In addition, the symbol ||.|| denotes the length of a

list, i.e., the total number of distinct terms in the set. The most common similarity

search problem is to find the document D j when given a document D i as a pivot such

that D i and D j is the most similar pair in Ω. In general, a document D i can share com-

mon terms with other documents in the worksets. In the scope of this paper, we define

such common terms as those appearing in all worksets. Furthermore, the inverted

document frequency idf ik shows how popular a term k of a document D i is compared to

other worksets. The inverted document frequency idf ik is computed by dividing the

length of worksets || Ω || by the number of documents that also contain the term k . Last

but not least, we employ the inverted index as a data structure to quickly support term

filtering.

In this paper, we use Jaccard measure, which is widely used and also adopted in

[12], to estimate the similarity score of a document pair. Given two sets X and Y, the

Jaccard coefficient measuring the similarity between the two sets X and Y is com-

puted as the equation (1) below:

, || ||

| | | | 1

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home