Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

16.3.2

Context-Aware Method

Context-Aware Clustering. The notion of context was introduced to cluster

analysis in [17, 22]. Context was typically involved to provide a more reliable sim-

ilarity calculation between objects by taking into account their relative nearest

neighbor objects. Our goal by involving context in clustering is totally different;

actually, we aim to involve context in order to enhance the ability of validity

indices to be used as stopping criteria where a first drop (FD) in the quality

of a clustering solution can more relevantly indicate that the optimal solution

has been reached. Reconsidering Figure 16.1, the goal is to enhance the classic

index curve with context-awareness, in order to obtain another temperate curve

( index + context curve) where the FD (at d ) approaches as much as possible the

optimal solution (at c ).

To achieve this goal, the idea is to provide clustering algorithms with a wider

view on the dataset partition, which will enable them to take decisions while

having in “mind” an “idea” on what could happen next if a specific action

is undertaken. As we are seeking the hierarchical agglomerative algorithm, the

method applies the following heuristic at each level j of the process:

1. Consider the M closest pairs of clusters;

2. estimate VI after trying to merge each of the M pairs;

3. among the mergings that improve VI (over the previous VI at j

−

1), assess

the merging with the lowest Context Risk ( CR );

4. if no merging improves VI , merge the pair that optimizes (maximizes/

minimizes) VI .

Consider the bi-dimensional dataset provided in Figure 16.2 where each point

represents a cluster centroid 3 . Suppose that at this stage we have two merging

options: (1) merging C i and C j into a new cluster C p , and (2) merging C m

and C n into a new cluster C q . Suppose that C p optimizes a validity index VI ,

while C q simply improves it. Before taking any definitive decision, the method

examines the context of C p and C q in terms of their K Nearest Neighbors K -NN

(i.e., surrounding clusters) 4 . If the context of C p tells that merging C i and C j

could lead to a global quality degradation in terms of VI in next iterations, the

method chooses to create rather C q improving VI at a minimal context risk.

Applying this method surely implies a more temperate and a slower improve-

ment in VI (as shown at the curve index + context in Figure 16.1), but has the

advantage of continuously pushing, as much as possible, risky merging actions

entailing possible future degradations for later processing. We argue that taking

the “safest” action at each level leads an expected degradation to occur as late

as possible during the process. Thus, a first drop (FD) is likely to occur closer to

the optimal solution, which will offer the possibility to the algorithm to consider

more relevantly FD at ( d + 1) as a stopping criteria, and the solution provided

at d as the optimal clustering solution.

3 S i is the centroid of cluster C i .

4 We set K =10inourexperiments.

Mining Complex Data

Search WWH ::

Custom Search

Home