Database Reference
In-Depth Information
case, we utilize the MAP-2 method to filter candidate pairs against the k parameter
before their similarity score is computed. In this case, the mappers from MAP-2 me-
thod with Pre-pruning only emit top-k intermediate key-value pairs. For approximate
k-NN query, however, each mapper can emit top-pairs whose size is according to the
total number of running mappers as the equation (5) below:
,1 5
toppairs
max
6
Experiments
6.1
Environment Settings
The experiments are set up in the cluster of commodity machines called Alex, which
has 48 nodes and 8 CPU cores and either 96 or 48 GB RAM for each node [2]. The
stable Hadoop has the version1.2.1, and DBLP dataset [7] is used to do similarity
search on the title of publications. In general, we leave other Hadoop configurations
as its default. We want to preserve the most common settings which commodity ma-
chines may initially get even though some parameters could be tuned or optimized to
fit the Alex cluster. The configured capacity is 5GB per node, so there totally is
240GB for the 48-node cluster. The number of reducers is set to 168. In addition, the
replication factor is set to 47. The possible heap size of the cluster is about 629 MB,
and each HDFS file has 64MB Block Size. It is worth noting that Alex has suffered
the overhead of other coordinating parallel tasks, i.e., these nodes are not exclusively
for the experiments. In addition, they are diskless nodes, but in the background data
are located on a storage area network. Each benchmark, therefore, is run 10 times to
obtain the average values. Last but not least, each benchmark has its fresh running. In
other words, data from the old benchmark are removed before the new benchmark
starts. All the experiments for one type of query are consecutively run so that their
environments are close as much as possible.
6.2
Empirical Evaluation
Fig. 6a represents the query processing time of pairwise similarity case between the
naïve approach, which uses MapReduce to compute all possible pairs, and the pro-
posed approach. The difference between two approaches is not so big when the
benchmark size is under a specific threshold. The reason is that they have to suffer
from the operation cost of the whole system. When the dataset size significantly in-
creases, the query processing time of the proposed method is not so much as that of
the naïve method. In other words, when the dataset size is large enough, there is a
very big gap between them, which indicates the query processing time of the naïve
approach consumes much more while the proposed method outperforms the naïve
method in its performance from 5% to 39% when the dataset size grows from 50MB
to 500MB. On the other hand, the percentage of saved data volume during the compu-
tation phases ranges from 67% to 82%, respectively in Fig. 6b.
Search WWH ::




Custom Search