Information Technology Reference
In-Depth Information
An Immune Network
for Contextual Text Data Clustering
Krzysztof Ciesielski, Slawomir T. Wierzchon, and Mieczyslaw A. Klopotek
Institute of Computer Science, Polish Academy of Sciences,
ul. Ordona 21, 01-237 Warszawa,Poland
{ kciesiel, stw, klopotek } @ipipan.waw.pl
Abstract. We present a novel approach to incremental document maps
creation, which relies upon partition of a given collection of documents
into a hierarchy of homogeneous groups of documents represented by
different sets of terms. Further each group (defining in fact separate con-
text) is explored by a modified version of the aiNet immune algorithm
to extract its inner structure. The immune cells produced by the algo-
rithm become reference vectors used in preparation of the final document
map. Such an approach proves to be robust in terms of time and space
requirements as well as the quality of the resulting clustering model.
1
Introduction
Analyzing the number of terms per query in one billion accesses to the Altavista
site, [10], it was observed that in 20.6% queries no term was entered; one quarter
used just one term in a search, and the average was not much higher than two
terms! This justifies our interest in looking for a more ”user-friendly” interfaces
to web-browsers.
According to so-called Cluster Hypothesis, [16], relevant documents tend to be
highly similar to each other, and therefore tend to appear in the same clusters.
Thus, it is possible to reduce the number of documents that need to be compared
to a given query, as it suces to match the query against cluster representatives
first. However such an approach offers only technical improvement in searching
relevant documents. A more radical improvement can be gained by using so-
called document maps, [2], where a graphical representation allows additionally
to convey information about the relationships of individual documents or group
of documents. Document maps are primarily oriented towards visualization of a
certain similarity of a collection of documents, although other usage of such the
maps is possible - consult Chapter 5 in [2] for details.
The most prominent representative of this direction is the WEBSOM project.
Here the Self-Organizing Map (SOM [14]), algorithm is used to organize mis-
cellaneous text documents onto a 2-dimensional grid so that related documents
appear close to each other. Each grid unit contains a set of closely related doc-
uments. The color intensity reflects dissimilarity among neighboring units: the
lighter shade the more similar neighboring units are. Unfortunately this approach
Search WWH ::




Custom Search