An Immune Network for Contextual Text Data Clustering - Artificial Immune Systems

Information Technology Reference

In-Depth Information

An Immune Network

for Contextual Text Data Clustering

Krzysztof Ciesielski, Slawomir T. Wierzchon, and Mieczyslaw A. Klopotek

Institute of Computer Science, Polish Academy of Sciences,

ul. Ordona 21, 01-237 Warszawa,Poland

{ kciesiel, stw, klopotek } @ipipan.waw.pl

Abstract. We present a novel approach to incremental document maps

creation, which relies upon partition of a given collection of documents

into a hierarchy of homogeneous groups of documents represented by

different sets of terms. Further each group (defining in fact separate con-

text) is explored by a modified version of the aiNet immune algorithm

to extract its inner structure. The immune cells produced by the algo-

rithm become reference vectors used in preparation of the final document

map. Such an approach proves to be robust in terms of time and space

requirements as well as the quality of the resulting clustering model.

1

Introduction

Analyzing the number of terms per query in one billion accesses to the Altavista

site, [10], it was observed that in 20.6% queries no term was entered; one quarter

used just one term in a search, and the average was not much higher than two

terms! This justifies our interest in looking for a more ”user-friendly” interfaces

to web-browsers.

According to so-called Cluster Hypothesis, [16], relevant documents tend to be

highly similar to each other, and therefore tend to appear in the same clusters.

Thus, it is possible to reduce the number of documents that need to be compared

to a given query, as it suces to match the query against cluster representatives

first. However such an approach offers only technical improvement in searching

relevant documents. A more radical improvement can be gained by using so-

called document maps, [2], where a graphical representation allows additionally

to convey information about the relationships of individual documents or group

of documents. Document maps are primarily oriented towards visualization of a

certain similarity of a collection of documents, although other usage of such the

maps is possible - consult Chapter 5 in [2] for details.

The most prominent representative of this direction is the WEBSOM project.

Here the Self-Organizing Map (SOM [14]), algorithm is used to organize mis-

cellaneous text documents onto a 2-dimensional grid so that related documents

appear close to each other. Each grid unit contains a set of closely related doc-

uments. The color intensity reflects dissimilarity among neighboring units: the

lighter shade the more similar neighboring units are. Unfortunately this approach

Search WWH ::

Custom Search

Home