An Immune Network for Contextual Text Data Clustering - Artificial Immune Systems

Information Technology Reference

In-Depth Information

unique subset of terms. In the sequel we will distinguish between hierarchical and

contextual model. In the former the set of terms, with tfidf weights (eq. (1)),

is identical for each subgroup of documents, while in the later each subgroup is

represented by different subset of terms weighted in accordance with the equation

(3). Finally, when we do not split the entire set of documents and we construct a

single, ”flat”, representation for whole collection - we will refer to global model.

The contextual approach consists of two main stages. At first stage a hierar-

chical model is built, i.e. a collection D of documents is recurrently divided -

by using Fuzzy ISODATA algorithm [4] - into homogenous groups consisting of

approximately identical number of elements. Such a procedure results in a hier-

archy represented by a tree of clusters. The process of partitioning halts when

the number of documents inside each group meets predefined criteria 1 .Tocom-

pute the distance dist ( d, c )ofadocument d from a centroid c , the next function

was used: dist ( d, c )=1

> stands

for the dot-product of two vectors. Given m dG the degree of membership of a

document d to a group G this document is assigned to the group with highest

value of m dG .

The second phase of contextual document processing is division of terms space

(dictionary) into - possibly overlapping - subspaces of terms specific to each

context (i.e. the group extracted in previous stage). The fuzzy membership level,

m tG , representing importance of a particular term t in a given context G is

computed as:

−

<d/

,c/

> ,wherethesymbol <

d∈G ( f td ·

m dG )

f G · d∈G m dG

m tG =

(2)

where f G is the number of documents in the cluster G , m dG is the degree of

membership of document d to group G , f td is the number of occurrences of term

t in document d . We assume that a term t is relevant for a given context G if

m tG > ,where is a parameter.

Removing non-relevant terms leads to the topic-sensitive reduction of the di-

mension of the terms space. This reduction results in new vector representation of

documents; each component of the vector is computed according to the equation:

log

f G

f t ·

w tdG = f td ·

m tG ·

(3)

m tG

where f t is the number of documents in the group G containing term t .

To depict similarity relation between contexts (represented by a set of con-

textual models), additional ”global” map is required. Such a model becomes the

root of contextual maps hierarchy. Main map is created in a manner similar to

previously created maps with one distinction: an example in training data is a

weighted centroid of referential vectors of the corresponding contextual model:

x i = c∈M i ( d c ·

v c ), where M i is the set of cells in i -th contextual model, d c is

the density of the cell and v c is its referential vector.

1 Currently a single criterion saying that the cardinality c i of i -th cluster cannot exceed

a given boundaries [ c min ,c max ]. This way the maps created for each group at the

same level of a given hierarchy will contain similar number of documents.

Artificial Immune Systems

Search WWH ::

Custom Search

Home