An Immune Network for Contextual Text Data Clustering - Artificial Immune Systems

Information Technology Reference

In-Depth Information

is time and space consuming, and rises questions of scaling and updating of doc-

ument maps (although some improvements are reported in [15]). To overcome

some of these problems the DocMINER system was proposed in [2].

In our research project BEATCA, [13], oriented towards exploration and nav-

igation in large collections of documents a fully-fledged search engine capable of

representing on-line replies to queries in graphical form on a document map has

been designed and constructed [12]. A number of machine-learning techniques,

like fast algorithm for Bayesian networks construction [13], SVD analysis, Grow-

ing Neural Gas (GNG) [9], SOM algorithm, etc., have been employed to realize

the project. BEATCA extends the main goals of WEBSOM by a multilingual

approach, new forms of geometrical representation (besides rectangular maps,

projections onto sphere and torus surface are possible); further we experimented

with various modifications of the entire clustering process by using the SOM,

GNG and immune algorithms.

In this paper we focus on some problems concerning application of an immune

algorithm to extract clustering structure. In section 2 we present our hierarchical,

topic-sensitive approach, which appears to be a robust solution to the problem

of scalability of map generation process (both in terms of time complexity and

space requirements). It relies upon extraction of a hierarchy of concepts, i.e. al-

most homogenous groups of documents described by unique sets of terms. To

represent the content of each context a modified version the aiNet [7] algorithm

was employed - see section 3. This algorithm was chosen because of its ability

of representing internal patterns existing in a training set. To evaluate the ef-

fectiveness of the novel text clustering procedure, it has been compared to the

aiNet and SOM algorithms in section 4. In the experimental sections 4.5-4.7 we

have also investigated issues such as evaluation of immune network structure

and the influence of the chosen antibody/antigen representation on the resulting

immune memory model. Final conclusions are given in section 5.

2

Contextual Local Networks

In our approach - like in many traditional IR systems - documents are mapped

into m -dimensional term vector space. The points (documents) in this space

are of the form ( w 1 ,d , ..., w m,d )where m stands for the number of terms, and

each w t,d is a weight for term t in document d , so-called term frequency/inverse

document frequency ( tfidf )weight:

log N

f t

w t,d = w ( t, d )= f td ·

(1)

where f td is the number of occurrences of term t in document d , f t is the number

of documents containing term t and N is the total number of documents.

The vector space model has been criticized for some disadvantages, polysemy

and synonymy, among others, [3]. To overcome these disadvantages a contextual

approach has been proposed relying upon dividing the set of documents into

a number of homogenous and disjoint subgroups each of which is described by

Artificial Immune Systems

Search WWH ::

Custom Search

Home