Database Reference
In-Depth Information
The Data Scientist may have a choice of a dozen or more attributes to use in
the clustering analysis. Whenever possible and based on the data, it is best to
reduce the number of attributes to the extent possible. Too many attributes can
minimize the impact of the most important variables. Also, the use of several
similar attributes can place too much importance on one type of attribute. For
example, if five attributes related to personal wealth are included in a clustering
analysis, the wealth attributes dominate the analysis and possibly mask the
importance of other attributes, such as age.
When dealing with the problem of too many attributes, one useful approach is to
identify any highly correlated attributes and use only one or two of the correlated
attributes in the clustering analysis. As illustrated in Figure 4.10 , a scatterplot
matrix, as introduced in Chapter 3, is a useful tool to visualize the pair-wise
relationships between the attributes.
Figure 4.10 Scatterplot matrix for seven attributes
The strongest relationship is observed to be between Attribute3 and
Attribute7 . If the value of one of these two attributes is known, it appears
that the value of the other attribute is known with near certainty. Other linear
relationships are also identified in the plot. For example, consider the plot of
Search WWH ::




Custom Search