Advanced Analytical Theory and Methods: Clustering - Data Science and Big Data Analytics

Database Reference

In-Depth Information

4.3 Additional Algorithms

The k-means clustering method is easily applied to numeric data where the concept

of distance can naturally be applied. However, it may be necessary or desirable

to use an alternative clustering algorithm. As discussed at the end of the previous

section, k-means does not handle categorical data. In such cases, k-modes [3] is

a commonly used method for clustering categorical data based on the number of

differences in the respective components of the attributes. For example, if each

object has four attributes, the distance from (a, b, e, d) to (d, d, d, d) is 3. In R, the

function kmode() is implemented in the klaR package.

Because k-means and k-modes divide the entire dataset into distinct groups, both

approaches are considered partitioning methods. A third partitioning method is

known as Partitioning around Medoids (PAM) [4]. In general, a medoid is a

representative object in a set of objects. In clustering, the medoids are the objects

in each cluster that minimize the sum of the distances from the medoid to the

other objects in the cluster. The advantage of using PAM is that the “center” of

each cluster is an actual object in the dataset. PAM is implemented in R by the

pam() function included in the cluster R package. The fpc R package includes a

function pamk() , which uses the pam() function to find the optimal value for k.

Other clustering methods include hierarchical agglomerative clustering and density

clustering methods. In hierarchical agglomerative clustering, each object is initially

placed in its own cluster. The clusters are then combined with the most similar

cluster. This process is repeated until one cluster, which includes all the objects,

exists. The R stats package includes the hclust() function for performing

hierarchical agglomerative clustering. In density-based clustering methods, the

clusters are identified by the concentration of points. The fpc R package includes

a function, dbscan() , to perform density-based clustering analysis. Density-based

clustering can be useful to identify irregularly shaped clusters.

Search WWH ::

Custom Search

Home