A Projected Clustering Algorithm and Its Biomedical Application - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

the two algorithms have compatible accuracy. This can be explained by the use

of modified Manhattan segmental distance in IPROCLUS. Modified Manhattan

segmental distance can effectively deal with scaled data while for unscaled data,

it is quite close to Manhattan segmental distance which is used in PROCLUS.

When comparing the three cases together, it can be seen that the no-common-

dimensions case has the lowest accuracy rate and the all-same-dimensions case

has the highest accuracy rate. It can be explained by the difference in the average

number of dimensions in a cluster. When the average number of dimensions is low

(in no-common-dimensions, l =4), there is a higher probability that data points are

assigned to the wrong cluster since points are just correlated on 4 dimensions.

However, in the all-same-dimensions case, where l =10, it is easier to correctly

cluster points since the correlation between data points is stronger.

Second, we present the result of testing the dependence on l . The dependence

is evaluated by the least square error of the number of dimensions.

The same

datasets in the three cases as in the accuracy test are used.

Figure 9.4 shows the results we get for the unscaled dataset in the random case.

We get similar results for the extreme cases. We can see that IPROCLUS has less

dependence on l than PROCLUS in terms of the number of dimensions. For the

scaled data in the three cases, we have got similar trend and we don'tgivethe

figures here since PROCLUS has much higher error rate for the scaled datasets.

Fig. 9.4.

Dependence of the Number of Dimensions on l

In summary, IPROCLUS greatly reduces the dependence on parameter l ,since

the dimension tuning process checks for additional dimensions for each cluster in

order to add any dimension that can enable better clustering, while PROCLUS

decide the average number of dimensions solely based on l .

For the running time test, we apply IPROCLUS and PROCLUS on different

number of points. The datasets are generated in the same way as the datasets

used in the previous two tests. The result we get is that the execution time of

these two algorithms is comparable for all the three different cases.

Since the

Search WWH ::

Custom Search

Home