Information Technology Reference
In-Depth Information
summaries. The average length of such abstract is 132 words. To process
also number and symbols, it has been decided to cancel the words that are
too scarce (less that 50 occurrences) and also to cancel 1355 words that are
semantically poor. Eventually a set of 43,222 words has been considered for
the whole corpus.
Several versions of the system exist. The earlier was coding straight the
text histogram by a vector, the length of which was the word number of the
corpus. According to that coding, each component of the text representative
vector represents the weighted occurrence frequency of the associated word
in the text. The weights were fixed according to the influence of the word
on the document global meaning. This dimension was too large to allow fur-
ther processing. Several data compression methods were proposed to cope
with this high dimension problem: projection reduction (principal component
analysis) or random projection. Eventually a random projection method was
implemented. A 500-dimensional vector represents each text. Such a vector is
a text summary coming from a statistical analysis of the text vocabulary. The
coding complexity is O(NL)+(n), where N is the document number, L is the
average number of distinct words in a document and n is the initial histogram
dimension. To appreciate the reduction, it is interesting to point out that the
simpler projection compression method has a complexity equivalent to Nld .
Thus, the reduction is quite important and enables to extend Websom over
the whole corpus.
7.5.3.2 Specific Features of Learning Process
A visual representation of the corpus organization is possible through the two-
dimensional map. That is a great help for documentary research. At the end
of the learning phase, the allocation phase that associates a neuron to a doc-
ument enables to locate a given document with respect to the global corpus:
texts with similar meanings are supposed to be located in close zones on the
map. In Websom's last version, the corpus is divided into 21 sections (agricul-
ture, transportation, chemistry, electricity, etc.). To extract this information,
each neuron is endowed with one of the section labels and a set of key words.
These keywords are extracted from the subset of texts that are allocated to
this neuron. More precisely, the type is determined through a majority voting
over the text subset and the keywords are selected by building the intersection
of the key-words set of every text of the text subset.
When Websom is used, texts with close meaning are projected in closed
regions of the two-dimensional map. So, projecting the text onto the map
enables to locate its meaning with respect to the whole set of texts of the
learning basis, actually the whole corpus. Using the map labeling enables to
interpret a new text through an automatic process. The neighbor neurons
provide subsidiary information that allows a finer understanding.
Considering the very large number of documents that lie in the basis,
a large amount of neurons are required in order to perform a fine enough
Search WWH ::




Custom Search