An Immune Network for Contextual Text Data Clustering - Artificial Immune Systems

Information Technology Reference

In-Depth Information

On the extreme side, the dynamics of the longest edges' distribution is similar

in case of the contextual and the global model, but distinct in case of the hier-

archical model. This last contains much more very long edges. Recalling that

the variance of the edge lengths has been low for this model and the average

length has been high, we can conclude that hierarchical model is generally more

discontinuous. The same is true for the SOM model, which is another indication

of the imperfection of the static grid topology.

5

Concluding Remarks

The contextual model described in this paper admits a number of interesting

and valuable features in comparison with global and hierarchical models used

traditionally to represent a given collection of documents. Further, when apply-

ing immune algorithm to clustering the collection of documents, a number of

improvements was proposed. These improvements obey:

- Identification of redundant antibodies by means of the fast agglomerative

clustering algorithm [13].

- Fast generation of mutated clones without computation of their stimula-

tion by currently presented antigen. These mutants can be characterized by

presumed ability of generalization (cf. section 3.2).

- Time-dependent parameters σ d and σ s . In general we have no a recipe allow-

ing to tune both the parameters to a given dataset. In original approach [7]

a trial-and-error method was suggested. We observed that in highly dimen-

sional space the value of σ d is almost as critical as the value of σ s . Hence we

propose a ”consistent” tuning of these parameters - cf. section 3.3. The gen-

eral recipe is: carefully (i.e. not to fast) remove weakly stimulated and too

specific antibodies and carefully splice redundant (too similar) antibodies.

- Application of the CF-trees [18] for fast identification of winners (most stim-

ulated memory cells) [6].

With these improvements we proposed a new approach to mining high dimen-

sional datasets. The contextual approach described in section 2 appears to be

fast, of good quality (in term of indices introduced in sections 4.1 and 4.2) and

scalable (with the data size and dimension).

Clustering high dimensional data is both of practical importance and at

the same time a big challenge, in particular for large collections of text doc-

uments. The paper presents a novel approach, based on artificial immune sys-

tems, within the broad stream of map type clustering methods. Such approach

leads to many interesting research issues, such as context-dependent dictionary

reduction and keywords identification, topic-sensitive document summarization,

subjective model visualization based on particular user's information require-

ments, dynamic adaptation of the document representation and local similarity

measure computation. We plan to tackle these problems in our future work. It

has to be stressed that not only textual, but also any other high dimensional

data may be clustered using the presented method.

Artificial Immune Systems

Search WWH ::

Custom Search

Home