Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

each VI , assess that with the context-aware method we can still have a compa-

rable and sometimes better clustering quality than the standard method without

involving any context-awareness. In average, using the method led the FScore at

the optimal value of VI to an improvement of 1.27% and 0.16% in DS2 and DS3,

and to a deterioration of 0.21% and 0.14% in DS1 and DS4 respectively. This can

be explained by the “safe” decisions taken at each step of the process. Although

not improving VI at the highest speed, these context-aware decisions, by foresee-

ing the upcoming mergings, provide better clustering possibilities in future itera-

tions, and thus competitive partitions quality at the end.

16.5.3

On the Quality of the Final Solutions

More informative than the quality of the optimal solutions, is the quality of the

final provided solutions obtained when stopping the process before FD .Thus,

these solutions, provided with/without using context-awareness, are also evalu-

ated in terms of FScore . In average, using the context-aware method contributed

to an FScore improvement of 63.14%, 30.16%, 10.04%, and 19.53% in DS1, DS2,

DS3, and DS4 respectively. We can notice that the largest improvements are

noticed in the document clustering datasets. This is not surprising given the

relatively poor representation of words patterns comparing to documents.

16.6

Conclusion and Future Trends

On the hand, we presented an experimental study that showed that indices

perform generally “well” at evaluating solutions, especially when dealing with

words. However, although they are supposed to be completely insensitive to the

number of clusters k , they have showed some rigidity to k , leading to erroneous

top-ranked solutions. In addition to that, we saw that these indices when involved

as criterion functions yield slightly better results to the case where indices were

simply used as external indicators.

On the other hand, we studied the feasibility of using relative indices as stop-

ping criteria in agglomerative clustering algorithms. Experiments performed in

two applications, document and word clustering, showed that indices used alone

are not effective for such purpose. Thus, we presented a method that aims to

smooth indices' plots by taking the “safest” decision at each level of a clustering

process. We demonstrated that the method could remarkably enhance the usage

of relative indices as stopping criteria.

An important drawback in most relative indices is their high computational

cost. Yet, their utilization seems crucial in view of a parameter-free clustering.

That is, an important trend is to develop methods that could accurately approx-

imate their values on reduced and representative subsets of data. Among the few

works that have been conducted in this direction, we can cite [11]. Such works,

if tied with ecient clustering methods (e.g., CLIQUE, PROCLUS, Bisective

k -means, frequent pattern-based methods), may enable an objective clustering

on large and high-dimensional datasets.

Mining Complex Data

Search WWH ::

Custom Search

Home