Model Data Selection and Data Pre-processing Approaches - Hydrological Data Driven Modelling

Geology Reference

In-Depth Information

3.4.2 Partition Clustering (K-Means Clustering)

Partition methods are among the most popular techniques in clustering and break

the objects or observations into distinct non-overlapping groups based on some

fixed criteria. There are many partition methods available in the literature based on

criteria such as K-means, K-medians, etc. Some common non-hierarchical methods,

apart from the K-means method, are nucleated agglomerative clustering and clus-

tering using mixture distributions. K-means clustering is the best-known non-

hierarchical clustering technique methods for data grouping, which is also, known

as Forgy

s algorithm. The K-means predictive algorithm was developed by

MacQueen [ 50 ] who was inspired by the work of Steinhaus [ 70 ]. It was later

enhanced by J.A. Hartigan and M.A. Wong during 1975

'

1977. K-means clustering

is particularly appropriate if we specify the number of clusters a priori the clustering

process. Unlike hierarchical CA, the researcher needs to provide the number of

required clusters, then the K-means CA program searches for the best solution with

that number of clusters, whereas the hierarchical method look for n possible clusters

from the observation. This could be viewed as a disadvantage unless prior

knowledge of the correct number of groups exists. For this reason, K-means

clustering is often repeated for a range of initial group numbers and group

assignments.

The algorithm then proceeds in the following way:

-

x k

2. The distance between the data point xi, i , and all the centroids are calculated

3. If xi i is already a member of the group whose mean is closest, repeat step 2 for xi i +1 ,

otherwise reassign xi i to the group whose mean is closest and return to step 1.

This is repeated until all xi i is closest to its group mean. The K-means clustering

algorithm could be described simply as that in Fig. 3.5 .

In case of nucleated agglomerative clustering, the procedures of K-means

clustering and Ward ' s method are combined to make better grouping. In this

method, two numbers of groups must be speci

1. Centroids of each cluster is calculated

ed, an initial number of groups and

a

final number of groups. The initial number of groups is speci

ed to be larger than

the

final number of groups [ 9 ]. The procedure starts with K-means CA, after which

the two closest groups are merged according to Ward

s method (described in

Sect. 3.4.1.5 ). Clustering using mixture distributions is a method where multivariate

normal distributions are

'

Maximization (EM)

algorithm [ 23 , 29 , 53 , 75 ]. Each group is represented by a PDF. The probability of

membership of a data point to a group given its PDF is obtained by using the EM

algorithm. The data points are grouped by assigning each x i to the PDF f g x

fitted to the data via the Expectation

-

ðÞ

which

has the largest probability.

Likewise, as with many clustering approaches, the K-means clustering method

also possesses weaknesses. They are as follows:

Hydrological Data Driven Modelling

Search WWH ::

Custom Search

Home