Information Technology Reference
In-Depth Information
unique subset of terms. In the sequel we will distinguish between hierarchical and
contextual model. In the former the set of terms, with tfidf weights (eq. (1)),
is identical for each subgroup of documents, while in the later each subgroup is
represented by different subset of terms weighted in accordance with the equation
(3). Finally, when we do not split the entire set of documents and we construct a
single, ”flat”, representation for whole collection - we will refer to global model.
The contextual approach consists of two main stages. At first stage a hierar-
chical model is built, i.e. a collection D of documents is recurrently divided -
by using Fuzzy ISODATA algorithm [4] - into homogenous groups consisting of
approximately identical number of elements. Such a procedure results in a hier-
archy represented by a tree of clusters. The process of partitioning halts when
the number of documents inside each group meets predefined criteria 1 .Tocom-
pute the distance dist ( d, c )ofadocument d from a centroid c , the next function
was used: dist ( d, c )=1
> stands
for the dot-product of two vectors. Given m dG the degree of membership of a
document d to a group G this document is assigned to the group with highest
value of m dG .
The second phase of contextual document processing is division of terms space
(dictionary) into - possibly overlapping - subspaces of terms specific to each
context (i.e. the group extracted in previous stage). The fuzzy membership level,
m tG , representing importance of a particular term t in a given context G is
computed as:
<d/
||
d
||
,c/
||
c
||
> ,wherethesymbol <
·
,
·
d∈G ( f td ·
m dG )
f G · d∈G m dG
m tG =
(2)
where f G is the number of documents in the cluster G , m dG is the degree of
membership of document d to group G , f td is the number of occurrences of term
t in document d . We assume that a term t is relevant for a given context G if
m tG > ,where is a parameter.
Removing non-relevant terms leads to the topic-sensitive reduction of the di-
mension of the terms space. This reduction results in new vector representation of
documents; each component of the vector is computed according to the equation:
log
f G
f t ·
w tdG = f td ·
m tG ·
(3)
m tG
where f t is the number of documents in the group G containing term t .
To depict similarity relation between contexts (represented by a set of con-
textual models), additional ”global” map is required. Such a model becomes the
root of contextual maps hierarchy. Main map is created in a manner similar to
previously created maps with one distinction: an example in training data is a
weighted centroid of referential vectors of the corresponding contextual model:
x i = c∈M i ( d c ·
v c ), where M i is the set of cells in i -th contextual model, d c is
the density of the cell and v c is its referential vector.
1 Currently a single criterion saying that the cardinality c i of i -th cluster cannot exceed
a given boundaries [ c min ,c max ]. This way the maps created for each group at the
same level of a given hierarchy will contain similar number of documents.
 
Search WWH ::




Custom Search