Database Reference
In-Depth Information
take pairwise similarity into account. Nevertheless, the algorithm only considers term
frequency between documents as weights accumulating to the final similarity score by
their inner product. This way does not bring much accuracy to the final score measur-
ing similarity. Furthermore, the authors do not mention any methods to reduce the
size of candidate pairs. That means each document in the large collections is com-
pared one by one, which brings quadratic computing costs when the size of data
rapidly increases. From the work in [12], the authors also use MapReduce as the ske-
leton of their approach. The costs for MapReduce operations, however, are so high
due to so many phases. There are two MapReduce jobs to construct the dictionary,
one MapReduce job to transform texts into vector texts, one MapReduce job to calcu-
late the inverted file, and two MapReduce jobs to resolve the query text. Moreover,
the big size of the prefixes will cause computing overheads and slow down the system
when massive datasets are given. The authors in [4] propose 2-phase MapReduce for
the self-join problem. The first MapReduce job is to build an inverted index whereas
the second MapReduce job is to calculate similarity scores based on partial contribu-
tions. Their approach requires overriding the group operator and partitioning function
of Hadoop while our approach employs the most fundamental principle of MapRe-
duce, which is originally supported by Hadoop and has wider adoption. Last but not
least, our work also aims at reducing candidate size before computing partial results
which are then accumulated at Reduce task. We will show how we make the best use
of our filtering strategies in comparison with other work.
3
Preliminaries
3.1
Concepts
Given a set of documents known as worksets Ω = {D 1 , D 2 , D 3 , …, D n }, and each doc-
ument which has the plain text form contains a set of terms D i = {term 1 , term 2 ,
term 3 , …, term k }. The symbol [,] represents the list of x elements while the symbol
[[,], [,]] indicates the list of lists. In addition, the symbol ||.|| denotes the length of a
list, i.e., the total number of distinct terms in the set. The most common similarity
search problem is to find the document D j when given a document D i as a pivot such
that D i and D j is the most similar pair in Ω. In general, a document D i can share com-
mon terms with other documents in the worksets. In the scope of this paper, we define
such common terms as those appearing in all worksets. Furthermore, the inverted
document frequency idf ik shows how popular a term k of a document D i is compared to
other worksets. The inverted document frequency idf ik is computed by dividing the
length of worksets || Ω || by the number of documents that also contain the term k . Last
but not least, we employ the inverted index as a data structure to quickly support term
filtering.
In this paper, we use Jaccard measure, which is widely used and also adopted in
[12], to estimate the similarity score of a document pair. Given two sets X and Y, the
Jaccard coefficient measuring the similarity between the two sets X and Y is com-
puted as the equation (1) below:
, || ||
| | | | 1
Search WWH ::




Custom Search