Geoscience Reference
In-Depth Information
Frequently the number of partitions must be defined beforehand. Here we adopt
the approach of partitional clustering, building on the emergent self-organizing
feature map. The structures of the U-matrix are used to define the clusters, that is,
when the projections of data points (bestmatches) are found in a common valley.
The neurons of an ESOM can also be clustered using the clustering algorithm
U*C, which is based on grid projections and makes use of distance and density
information (Ultsch and Herrmann 2006 ). In our case, this approach leads to nine
U-matrix cluster (UC).
Clustering methods partition the data into clusters. The cluster structures criti-
cally depend, first, on the definition of a meaningful measure of distance (see above)
and, second, on the details of the clustering algorithm. If a known pre-classification
is at hand, then this may be used to evaluate the clustering. However, in most
real knowledge discovery cases, no such pre-classification is given. The question
arises as to which form of clustering is optimal. For the purposes of knowledge
discovery, the quality of any data clustering is determined by whether the resulting
classes offer some useful interpretation; in particular, whether these data classes
reveal unsuspected structures and correlations in the original data space. Hand et al.
( 2001 ) emphasize that the numerical size of clusters should not be accorded too
great importance, as it is precisely the unexpected something that goes against the
rules which is being sought.
Generally speaking, however, the validity of a clustering is often in the eye of the beholder;
for example, if a cluster produces an interesting scientific insight, we can judge it to be
useful. (Hand et al. 2001 , p. 292)
In such cases where new structures are detected, other non-supervised ap-
proaches should be adopted to validate the clustering results. One such approach
is to cluster the data using a different cluster algorithm. Another is to calculate
some cluster immanent measure. Finally, the approach which best meets the aims
of knowledge discovery is to seek a semantic interpretation of the detected clusters.
This means determining whether a cluster makes sense through the application of
knowledge generation methods (see next section).
Figure 3.13 shows a hierarchical clustering of the data using Ward clustering
(Ward 1963 ) to produce a dendrogram (Carlsson and Mémoli 2010 ). The user has
to define either a threshold distance or the number of clusters in order to define the
clustering in a hierarchical algorithm. In our case, a threshold distance of 100 was
used, giving 8 Ward Clusters.
The results of different clustering algorithms can be compared using contingency
tables (Fienberg 2007 ). In our case, the two methods have produced rather similar
clustering partitions (cf. Table 3.2 ). One of the outlier clusters, that is, number UC9
in the U-matrix clustering, has been subsumed to Ward Cluster WC6. In this case,
the Ward clustering basically confirms the U-matrix clustering and vice versa.
The silhouettes proposed by Rousseeuw ( 1987 ) are a useful graphical display
for the interpretation and validation of data partitioning. The values in a silhouette
range from 1 to C 1 for each data point. Large positive values indicate that a data
Search WWH ::




Custom Search