Database Reference
In-Depth Information
Figure 4.13 Clusters with rescaled attributes
With the rescaled attributes for age and height, the borders of the resulting clusters
now fall somewhere between the two earlier clustering analyses. Such an
occurrence is not surprising based on the magnitudes of the attributes of the
previous clustering attempts. Some practitioners also subtract the means of the
attributes to center the attributes around zero. However, this step is unnecessary
because the distance formula is only sensitive to the scale of the attribute, not its
location.
In many statistical analyses, it is common to transform typically skewed data, such
as income, with long tails by taking the logarithm of the data. Such transformation
can also be appied in k-means, but the Data Scientist needs to be aware of what
effect this transformation will have. For example, if of income expressed
in dollars is used, the practitioner is essentially stating that, from a clustering
perspective, $1,000 is as close to $10,000 as $10,000 is to $100,000
. In many
cases, the skewness of the data may be the reason to perform the clustering analysis
in the first place.
Additional Considerations
The k-means algorithm is sensitive to the starting positions of the initial centroid.
Thus, it is important to rerun the k-means analysis several times for a particular
value of k to ensure the cluster results provide the overall minimum WSS. As seen
earlier, this task is accomplished in R by using the nstart option in the kmeans()
function call.
This chapter presented the use of the Euclidean distance function to assign the
points to the closest centroids. Other possible function choices include the cosine
Search WWH ::




Custom Search