Advanced Analytical Theory and Methods: Clustering - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Figure 4.13 Clusters with rescaled attributes

With the rescaled attributes for age and height, the borders of the resulting clusters

now fall somewhere between the two earlier clustering analyses. Such an

occurrence is not surprising based on the magnitudes of the attributes of the

previous clustering attempts. Some practitioners also subtract the means of the

attributes to center the attributes around zero. However, this step is unnecessary

because the distance formula is only sensitive to the scale of the attribute, not its

location.

In many statistical analyses, it is common to transform typically skewed data, such

as income, with long tails by taking the logarithm of the data. Such transformation

can also be appied in k-means, but the Data Scientist needs to be aware of what

effect this transformation will have. For example, if of income expressed

in dollars is used, the practitioner is essentially stating that, from a clustering

perspective, $1,000 is as close to $10,000 as $10,000 is to $100,000

. In many

cases, the skewness of the data may be the reason to perform the clustering analysis

in the first place.

Additional Considerations

The k-means algorithm is sensitive to the starting positions of the initial centroid.

Thus, it is important to rerun the k-means analysis several times for a particular

value of k to ensure the cluster results provide the overall minimum WSS. As seen

earlier, this task is accomplished in R by using the nstart option in the kmeans()

function call.

This chapter presented the use of the Euclidean distance function to assign the

points to the closest centroids. Other possible function choices include the cosine

Search WWH ::

Custom Search

Home