Information Technology Reference
In-Depth Information
is time and space consuming, and rises questions of scaling and updating of doc-
ument maps (although some improvements are reported in [15]). To overcome
some of these problems the DocMINER system was proposed in [2].
In our research project BEATCA, [13], oriented towards exploration and nav-
igation in large collections of documents a fully-fledged search engine capable of
representing on-line replies to queries in graphical form on a document map has
been designed and constructed [12]. A number of machine-learning techniques,
like fast algorithm for Bayesian networks construction [13], SVD analysis, Grow-
ing Neural Gas (GNG) [9], SOM algorithm, etc., have been employed to realize
the project. BEATCA extends the main goals of WEBSOM by a multilingual
approach, new forms of geometrical representation (besides rectangular maps,
projections onto sphere and torus surface are possible); further we experimented
with various modifications of the entire clustering process by using the SOM,
GNG and immune algorithms.
In this paper we focus on some problems concerning application of an immune
algorithm to extract clustering structure. In section 2 we present our hierarchical,
topic-sensitive approach, which appears to be a robust solution to the problem
of scalability of map generation process (both in terms of time complexity and
space requirements). It relies upon extraction of a hierarchy of concepts, i.e. al-
most homogenous groups of documents described by unique sets of terms. To
represent the content of each context a modified version the aiNet [7] algorithm
was employed - see section 3. This algorithm was chosen because of its ability
of representing internal patterns existing in a training set. To evaluate the ef-
fectiveness of the novel text clustering procedure, it has been compared to the
aiNet and SOM algorithms in section 4. In the experimental sections 4.5-4.7 we
have also investigated issues such as evaluation of immune network structure
and the influence of the chosen antibody/antigen representation on the resulting
immune memory model. Final conclusions are given in section 5.
2
Contextual Local Networks
In our approach - like in many traditional IR systems - documents are mapped
into m -dimensional term vector space. The points (documents) in this space
are of the form ( w 1 ,d , ..., w m,d )where m stands for the number of terms, and
each w t,d is a weight for term t in document d , so-called term frequency/inverse
document frequency ( tfidf )weight:
log N
f t
w t,d = w ( t, d )= f td ยท
(1)
where f td is the number of occurrences of term t in document d , f t is the number
of documents containing term t and N is the total number of documents.
The vector space model has been criticized for some disadvantages, polysemy
and synonymy, among others, [3]. To overcome these disadvantages a contextual
approach has been proposed relying upon dividing the set of documents into
a number of homogenous and disjoint subgroups each of which is described by
 
Search WWH ::




Custom Search