Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

On indices rigidity over the number of clusters. In spite of their abil-

ity to reach “good” clustering solutions, relative indices provide relatively poor

solutions at their top-ranked solutions. In other words, while they could reach

high FScores at a specific k , they have hard-time detecting this optimal k ,lead-

ing unfortunately to poor solutions comparing to what they could have reached.

This can be explained by the somehow rigid trends over the number of clusters.

Even if these indices are supposed to be completely insensitive to k , depending

uniquely on the dataset, it is not always the case. This lead them to indicate op-

timal solutions at “wrong places”. The gaps between the FScores are dramatic;

for instance, while C 3 could lead the algorithm to an FScore of 0.60 in DS4,

it could only reach an FScore of 0.21 at its optimal value. The only exceptions

to this are the H 3and C 4 indices which seem to provide comparable FScores ,

since their high ability to detect the predefined optimal k .

The rigid trend over k , can be seen more clearly in Figures 16.7, 16.8. By fo-

cusing on the first colon “K @ Optimal index” , we can notice that most indices

are keeping similar relative trends over the optimal k along different datasets.

In our datasets, they generally show trends to large optimal k comparingtothe

“real” optimal k which is remarkably smaller ( ”K @ Optimal F ). These differ-

ences, between the predicted optimal k , and the “real” optimal k are naturally

behind the wide gaps in FScore that we have mentioned above. This explain also

why H 3and C 4 are exceptions to the other indices. In fact, their high ability

for reaching the optimal k is surely affecting their high ability for reaching the

optimal clustering solutions.

Are indices better used as external indicators or criterion functions?

As mentioned earlier, involving indices as criterion functions leads to much higher

complexity than by involving them as external indicators. In fact, it is natural to

believe that driving an algorithm by optimizing a validity index, would overcome

the approach that drives an algorithm basing on a “blind” similarity between

patterns (e.g., mean-linkage), without paying any consideration to the whole

clustering quality. However, surprisingly, our results showed that the difference

is not significant. For instance, consider the 'one-to-one' maximal reached rates

with each approach; the H3/mean-linkage criteria could lead to maximal FScore

of 0.695/0.671, 0.624/0.593, 0.245/0.240, and 0.204/0.192 respectively in DS1,

DS2, DS3, and DS4. Thus, slight improvements were noticed when involving

the H 3 index, but results are comparable, especially when involving the other

indices. Depending on each task requirements, the open question remains: Is

it worthwhile to considerably increase the algorithm complexity to reach only a

slight improvement in partitions quality? In most cases, the answer will be no.

However, for one seeking definitely the optimal partitions out of an algorithm,

a complexity reduction is highly recommended. This is broadly the purpose of

the method that we proposed in Section 16.3, which aims to reduce algorithms'

complexity by using indices as stopping criteria. This method is evaluated in the

next section.

Mining Complex Data

Search WWH ::

Custom Search

Home