Advanced Analytical Theory and Methods: Clustering - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Summary

Clustering analysis groups similar objects based on the objects' attributes.

Clustering is applied in areas such as marketing, economics, biology, and medicine.

This chapter presented a detailed explanation of the k-means algorithm and its

implementation in R. To use k-means properly, it is important to do the following:

• Properly scale the attribute values to prevent certain attributes from

dominating the other attributes.

• Ensure that the concept of distance between the assigned values within an

attribute is meaningful.

• Choose the number of clusters, k, such that the sum of the Within Sum of

Squares (WSS) of the distances is reasonably minimized. A plot such as the

example in Figure 4.5 can be helpful in this respect.

If k-means does not appear to be an appropriate clustering technique for a given

dataset, then alternative techniques such as k-modes or PAM should be considered.

Once the clusters are identified, it is often useful to label these clusters in some

descriptive way. Especially when dealing with upper management, these labels are

useful to easily communicate the findings of the clustering analysis. In clustering,

the labels are not preassigned to each object. The labels are subjectively assigned

after the clusters have been identified. Chapter 7 considers several methods to

perform the classification of objects with predetermined labels. Clustering can be

used with other analytical techniques, such as regression. Linear regression and

logistic regression are covered in Chapter 6, “Advanced Analytical Theory and

Methods: Regression.”

Search WWH ::

Custom Search

Home