Information Technology Reference
In-Depth Information
On the extreme side, the dynamics of the longest edges' distribution is similar
in case of the contextual and the global model, but distinct in case of the hier-
archical model. This last contains much more very long edges. Recalling that
the variance of the edge lengths has been low for this model and the average
length has been high, we can conclude that hierarchical model is generally more
discontinuous. The same is true for the SOM model, which is another indication
of the imperfection of the static grid topology.
5
Concluding Remarks
The contextual model described in this paper admits a number of interesting
and valuable features in comparison with global and hierarchical models used
traditionally to represent a given collection of documents. Further, when apply-
ing immune algorithm to clustering the collection of documents, a number of
improvements was proposed. These improvements obey:
- Identification of redundant antibodies by means of the fast agglomerative
clustering algorithm [13].
- Fast generation of mutated clones without computation of their stimula-
tion by currently presented antigen. These mutants can be characterized by
presumed ability of generalization (cf. section 3.2).
- Time-dependent parameters σ d and σ s . In general we have no a recipe allow-
ing to tune both the parameters to a given dataset. In original approach [7]
a trial-and-error method was suggested. We observed that in highly dimen-
sional space the value of σ d is almost as critical as the value of σ s . Hence we
propose a ”consistent” tuning of these parameters - cf. section 3.3. The gen-
eral recipe is: carefully (i.e. not to fast) remove weakly stimulated and too
specific antibodies and carefully splice redundant (too similar) antibodies.
- Application of the CF-trees [18] for fast identification of winners (most stim-
ulated memory cells) [6].
With these improvements we proposed a new approach to mining high dimen-
sional datasets. The contextual approach described in section 2 appears to be
fast, of good quality (in term of indices introduced in sections 4.1 and 4.2) and
scalable (with the data size and dimension).
Clustering high dimensional data is both of practical importance and at
the same time a big challenge, in particular for large collections of text doc-
uments. The paper presents a novel approach, based on artificial immune sys-
tems, within the broad stream of map type clustering methods. Such approach
leads to many interesting research issues, such as context-dependent dictionary
reduction and keywords identification, topic-sensitive document summarization,
subjective model visualization based on particular user's information require-
ments, dynamic adaptation of the document representation and local similarity
measure computation. We plan to tackle these problems in our future work. It
has to be stressed that not only textual, but also any other high dimensional
data may be clustered using the presented method.
Search WWH ::




Custom Search