Adaptive Affinity Propagation Clustering in MapReduce Environment - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

Adaptive Affinity Propagation Clustering

in MapReduce Environment

Wei-Chih Hung 1 , Yuan-Cheng Liu 1 , Yi-Leh Wu 1 ,

Cheng-Yuan Tang 2 , and Maw-Kae Hor 3

1 Department of Computer Science and Information Engineering,

National Taiwan University of Science and Technology, Taipei, Taiwan

2 Department of Information Management, Huafan University, New Taipei City, Taiwan

3 Kainan University, Taoyuan, Taiwan

ywu@csie.ntust.edu.tw

Abstract. The Affinity Propagation (AP) is a clustering algorithm based on the

concept of “message passing” between data points. Unlike most clustering algo-

rithms such as k-means, the AP does not require the number of clusters to be

determined or estimated before running the algorithm. There are implementa-

tion of AP on Hadoop, a distribute cloud environment, called the Map/Reduce

Affinity Propagation (MRAP). But the MRAP has a limitation: it is hard to

know what value of parameter “preference” can yield an optimal clustering so-

lution. The Adaptive Affinity Propagation Clustering (AAP) algorithm was

proposed to overcome this limitation to decide the preference value in AP. In

this study, we propose to combine these two methods as the Adaptive

Map/Reduce Affinity Propagation (AMRAP), which divides the clustering task

to multiple mappers and one reducer in Hadoop, and decides suitable preference

values individually for each mapper. In the experiments, we compare the clus-

tering results of the proposed AMRAP with the original MRAP method. The

experiment results support that the proposed AMRAP method outperforms the

original MRAP method in terms of accuracy.

Keywords: Affinity propagation, Map/Reduce, Hadoop, clustering algorithm.

1

Introduction

The Affinity Propagation (AP) [1] is a clustering algorithm that requires no pre-set

number of clusters K. The AP is a fast clustering algorithm especially in the case of

large number of clusters. Speed, general applicability and good performance are its

advantages. The AP works based on similarities between pairs of data points, and

simultaneously considers all the data points as the potential cluster centers, called

exemplars. There are two kinds of message exchanged between data points, and each

takes into account a different kind of competition. Messages can be combined at any

stage to decide which points are exemplars; and for the non-exemplar points, which

exemplar it belongs to. The “ responsibility” , , sent from data point i to candi-

date exemplar point k , reflects the accumulated evidence for how well-suited point k

Search WWH ::

Custom Search

Home