Advanced Analytical Theory and Methods: Clustering - Data Science and Big Data Analytics

Database Reference

In-Depth Information

4.2 K-means

Given a collection of objects each with n measurable attributes, k-means [1] is an

analytical technique that, for a chosen value of k, identifies k clusters of objects

based on the objects' proximity to the center of the k groups. The center is

determined as the arithmetic average (mean) of each cluster's n-dimensional vector

of attributes. This section describes the algorithm to determine the k means as well

as how best to apply this technique to several use cases. Figure 4.1 illustrates three

clusters of objects with two attributes. Each object in the dataset is represented by a

small dot color-coded to the closest large dot, the mean of the cluster.

Figure 4.1 Possible k-means clusters for k=3

4.2.1 Use Cases

Clustering is often used as a lead-in to classification. Once the clusters are identified,

labels can be applied to each cluster to classify each group based on its

characteristics. Classification is covered in more detail in Chapter 7, “Advanced

Analytical Theory and Methods: Classification.” Clustering is primarily an

exploratory technique to discover hidden structures of the data, possibly as a

prelude to more focused analysis or decision processes. Some specific applications

of k-means are image processing, medical, and customer segmentation.

Search WWH ::

Custom Search

Home