Database Reference
In-Depth Information
Fig. 6. Pairwise similarity case between the naïve approach and the proposed approach; (a)
Query processing time; and (b) The saved data volume
From the point of view of the pivot document case showed in Fig. 7, the dataset
size grows from 900MB to 4500MB and mostly reaches the maximum capacity for
each node. In Fig. 7a, there is not much difference in performance among all-pairs,
range query whose threshold is set to 90% of similarity, and 100-NN query which are
based on the proposed methods. Besides, Fig. 7b points out the percentage of saved
data volume which will be increased further if query parameters are more given.
Fig. 7. Pivot document, range query, and k-NN query cases; (a) Query processing time; and (b)
The saved data volume
Moreover, what are inherited from the experiments are MapReduce operations
should not be too complex due to the lack of memory and the reduction of candidate
size is essential because it helps reduce the number of volumes written in HDFS file
systems during MapReduce operations, which can lead to achieve high performance.
7
Conclusion and Future Work
In this paper, we propose an elastic approximate similarity search with MapReduce to
primarily deal with scalability. We show how our search scheme is specifically tai-
lored for the four most popular similarity search scenarios known as pairwise docu-
ments similarity, pivot document search, range query, and k-NN query. In addition,
our strategic filtering methods which promote potential scalability of MapReduce
help reduce the size of candidate pairs and eliminate unnecessary computations as
well as space overheads. Moreover, we conduct experiments with real massive data-
sets and Hadoop framework to verify these methods.
For our future work, we model worksets as distinct n-grams instead of terms and
extend our methods to other metrics to achieve more efficiency. Besides, we also
generalize our approach more concretely to the incremental case whose data are on
the fly. Furthermore, we concentrate on resolving other factors under the big data
context such as the velocity and variety of big data in order to consolidate our me-
thods and look forward to a unified solution supporting data-intensive applications.
Search WWH ::




Custom Search