Information Technology Reference
In-Depth Information
Adaptive Affinity Propagation Clustering
in MapReduce Environment
Wei-Chih Hung 1 , Yuan-Cheng Liu 1 , Yi-Leh Wu 1 ,
Cheng-Yuan Tang 2 , and Maw-Kae Hor 3
1 Department of Computer Science and Information Engineering,
National Taiwan University of Science and Technology, Taipei, Taiwan
2 Department of Information Management, Huafan University, New Taipei City, Taiwan
3 Kainan University, Taoyuan, Taiwan
ywu@csie.ntust.edu.tw
Abstract. The Affinity Propagation (AP) is a clustering algorithm based on the
concept of “message passing” between data points. Unlike most clustering algo-
rithms such as k-means, the AP does not require the number of clusters to be
determined or estimated before running the algorithm. There are implementa-
tion of AP on Hadoop, a distribute cloud environment, called the Map/Reduce
Affinity Propagation (MRAP). But the MRAP has a limitation: it is hard to
know what value of parameter “preference” can yield an optimal clustering so-
lution. The Adaptive Affinity Propagation Clustering (AAP) algorithm was
proposed to overcome this limitation to decide the preference value in AP. In
this study, we propose to combine these two methods as the Adaptive
Map/Reduce Affinity Propagation (AMRAP), which divides the clustering task
to multiple mappers and one reducer in Hadoop, and decides suitable preference
values individually for each mapper. In the experiments, we compare the clus-
tering results of the proposed AMRAP with the original MRAP method. The
experiment results support that the proposed AMRAP method outperforms the
original MRAP method in terms of accuracy.
Keywords: Affinity propagation, Map/Reduce, Hadoop, clustering algorithm.
1
Introduction
The Affinity Propagation (AP) [1] is a clustering algorithm that requires no pre-set
number of clusters K. The AP is a fast clustering algorithm especially in the case of
large number of clusters. Speed, general applicability and good performance are its
advantages. The AP works based on similarities between pairs of data points, and
simultaneously considers all the data points as the potential cluster centers, called
exemplars. There are two kinds of message exchanged between data points, and each
takes into account a different kind of competition. Messages can be combined at any
stage to decide which points are exemplars; and for the non-exemplar points, which
exemplar it belongs to. The responsibility” , , sent from data point i to candi-
date exemplar point k , reflects the accumulated evidence for how well-suited point k
 
Search WWH ::




Custom Search