Geology Reference
In-Depth Information
ods for the grouping are clustering techniques that
extract the center information of the subset of data
within a large data set. In this study, the hierarchi-
cal clustering algorithm is applied; and then, the
solution is compared with the other benchmark
clustering algorithms, such as the fuzzy C-means
and the subtractive clustering algorithms.
of sets are evaluated to find the minimum distance
pair, and the sets thus found are merged into a set.
By repeating this, the number of sets decreases by
1 at each iteration, so the desired number N q of
the sets are obtained after ( N - N q ) iterations. The
center of each cluster can be calculated from the
locations of data points composing it. Note that
while the procedure described above is called the
agglomerative hierarchical clustering, a divisive
strategy can also be applied to obtain the same
result. Note also that our choice of distance metrics
produces the equivalent result as splitting the set at
the ( N q -1) locations where the distances between
adjacent data points are farthest, since we are
concerned in the clustering of one dimensional
data in this study.
4.1. Premise Part
The hierarchical clustering algorithm determines
the membership of data to their clusters by con-
structing the hierarchical organization of a given
set of data, revealing the membership of each
datum differently at distinct hierarchy levels (Ward
1963; Clauset et al. 2008). Once the hierarchy
is established, one of the levels can be chosen
either to yield a desired number of clusters or to
optimize an objective function, depending on the
problem at hand. To describe the algorithm, the
dissimilarity or distance between a pair of data
u p is first considered as
4.2. Consequent Part
Once the premise part is optimized, the conse-
quent part parameters can be optimized with the
weighted linear least squares algorithm. Based on
Gauss's celebrated principle of least square (Gauss
1963), the linear least squares algorithm can be
formulated as a quadratic optimization problem
that minimizes the error between true values and
estimated model outputs
d
=
u I
( )
u J
( ) ,
(7)
IJ
p
p
where the definition of distance can take any
well-behaved metric for distance. In addition, the
distance between a pair of sets can be defined as
a function of the distances of all the pairs of data
which are extracted from respective sets, i.e.,
= 1
T
Min
J
2 e e
( )
k
( ),
k
(9)
where e
= − i.e., the error e ( k is
the difference between the estimation model ˆ ( )
( )
k
y
ˆ( )
k k
y
( );
D
=
F d
{
:
I U J V
,
},
(8)
uv
IJ
p
p
y k
and the true values y ( ) k Note that the normal
linear least squares formulation can be easily
extended into the weighted linear least squares
by introducing a factor of weight. Thus, in what
follows, the normal linear least squares estimator
is derived first and then the weight factor is
added into the normal least squares. A linear es-
timation model for use with the linear least squares
algorithm is
where F can be the minimum, the maximum, the
average, or any other functions of the elements,
d IJ . In this study, we choose the Euclidean distance
for d IJ and the minimum for F .
The algorithm starts with allocating each data
point to its own set, so that each set contains only
one data point. Then, the distances of all the pairs
Search WWH ::




Custom Search