Information Technology Reference
In-Depth Information
log
0.5
e
D
D
,
(10.29)
where D is the mean distance among the pairs of data points in a hyperspace.
Hence, D is determined by the data and can be calculated automatically.
10.5.2.2 Fuzzy Clustering Based on Entropy Measure
In order to determine the first cluster centre, the entropy at each data point is
evaluated. The data point that has the lowest entropy value is selected as a potential
cluster centre. Thereafter, this first cluster centre and all the data points that have
similarity with it greater than a threshold value of E are removed, so that they are
ignored as possible subsequent cluster centres in the next iterations. The procedure
is continued with the search for the next cluster, which is selected as the point with
the minimal entropy value among the remaining data points and, again, this cluster
centre and the associated data points having similarity greater than Eare similarly
removed. This process is repeated until no data points are left.
The parameter E can be viewed as a threshold of similarity value or as
association value among the data points in the same clusters. It assumes a value
within the range (0.0, 1.0), whereby the value of E= 0.7 is quite robust, as shown
experimentally in Yao et al . (2000). In the algorithm described below, T is the
input data with N data points, each of which has M dimensions.
Algorithm 10.1. Entropy-based fuzzy clustering: EFC(T)
x Step 1: calculate the entropy for each z i in T for i = 1, 2,…, N.
x Step 2: choose z iMin that has lowest entropy
x Step 3: remove z iMin and all the data points that have similarity greater
than E with the cluster centre z iMin from the data set T.
x Step 4: continue step 2 to 3 till T is not empty.
If the data set has outliers that are very distant from the rest of the data, then the
EFC algorithm described may select these data points for the cluster centres
because the entropy value for these data points will also be very low. To overcome
this problem, a new parameter J is introduced in Yao et al. (2000) that acts as a
threshold between potential clusters and the outliers. Before selecting a data point
as cluster centre the number of data points are counted that have similarity greater
than E with that cluster centre. If the number of counts is less than the value of J,
then that data point is unfit to be a cluster centre and should be rejected from the
data set, so that it is not considered further for the next iteration. In the work of
Yao et al. (2000) J= 0.05 N is selected as the threshold for outliers detection. The
selection of J and, therefore, the corresponding removal of outliers also prevent
the data overfitting.
Search WWH ::




Custom Search