A Novel Clustering Approach Using Hadoop Distributed Environment - Computational Intelligence Techniques for Comparative Genomics

Biomedical Engineering Reference

In-Depth Information

Over the course of the past decade, many technologies have promised to help

with the processing and analyzing of the vast amounts of information [ 1 ] we have,

and most of these technologies have come up short. We know this because as

programmers focused on data, we have tried it all. Many approaches have been

proprietary, resulting in vender lock-in. Some approaches are promising but

couldnot scale to handle large datasets and many were hyped up so much that they

couldnot meet expectations, or they simply were not ready for prime time.

When Apache Hadoop [ 2 ] entered the scene, however, everything was different.

Hadoop is an open source that had already found incredible success in massively

scalable commercial applications. Based on a MapReduce [ 3 , 4 ] algorithm that

enable us to bring processing to the data distributed on a scalable duster of

machines, we have found much success in performing complex data analysis in

ways that we havenot been able to do in past.

There was various numbers of methods of data analysis in the

eld of data

mining, pattern recognition, image processing, etc. Out of the existing methods,

K-Means is widely used. But clustering becomes more and more complex when the

process is done for large-scale datasets. The time complexity of K-Means algorithm

is O (NKD) where N is the number of objects, D number of iterations and k number

of clusters.

But the disadvantage with K-Means algorithm is k should be initialized and the

result varies with the value of k. Another disadvantage is it requires additional space

to store the data, and also for a given initial seed set of cluster centers, it generates

the same partition of the data irrespective of the order in which the patterns are

presented. Also, it doesnot necessarily

nd the most optimal [ 5 ]. It is sensitive to

the order of data input [ 6 ].

Hence, there is a need for an enhance algorithm that can minimize the above

disadvantage. Therefore, this paper introduces a novel and ef

cient technique when

the dataset is large. In this paper, we propose a new technique that includes FCM

with canopy algorithm. However, the implementation of FCM with canopy on

distributed computing yields better results.

The rest of the sections are organized as follows: Sect. 2 describes the archi-

tecture of the proposed method, Sect. 3 demonstrates about clustering process using

canopies, Sect. 4 elaborates Fuzzy C-Means algorithm for clustering the semi-

clustered groups, and the results of the proposed method have been discussed in

Sect. 5 . Finally, Sect. 6 concludes the paper.

2 Architecture of the Proposed Method

In the proposed architecture, the data available are initially fed as input to

nd the

approximate process of clustering to canopy technique and in the next step, the

points are assigned to canopy. After obtaining the initial clusters, the obtained

cluster groups are fed to FCM algorithm and the

nal clusters are obtained. The

Fig. 1 below demonstrates the different steps in the proposed architecture.

Computational Intelligence Techniques for Comparative Genomics

Search WWH ::

Custom Search

Home